BIDIRECTIONAL LEARNING FOR OFFLINE MODEL-BASED BIOLOGICAL SEQUENCE DESIGN Anonymous

Abstract

Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. In this paper, we focus on biological sequence design to maximize some sequence score. A recent approach employs bidirectional learning, combining a forward mapping for exploitation and a backward mapping for constraint, and it relies on the neural tangent kernel (NTK) of an infinitely wide network to build a proxy model. Though effective, the NTK cannot learn features because of its parametrization, and its use prevents the incorporation of powerful pre-trained Language Models (LMs) that can capture the rich biophysical information in millions of biological sequences. We adopt an alternative proxy model, adding a linear head to a pre-trained LM, and propose a linearization scheme. This yields a closed-form loss and also takes into account the biophysical information in the pre-trained LM. In addition, the forward mapping and the backward mapping play different roles and thus deserve different weights during sequence optimization. To achieve this, we train an auxiliary model and leverage its weak supervision signal via a bi-level optimization framework to effectively learn how to balance the two mappings. Further, by extending the framework, we develop the first learning rate adaptation module Adaptive-η, which is compatible with all gradient-based algorithms for offline model-based optimization. Experimental results on DNA/protein sequence design tasks verify the effectiveness of our algorithm. Our code is available here.

1. INTRODUCTION

Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. This offline setting is realistic since in many real-world scenarios we do not have interactive access to the ground-truth evaluation. The design tasks of interest include material, aircraft, and biological sequence (Trabucco et al., 2021) . In this paper, we focus on biological sequence design, including DNA sequence and protein sequence, with the goal of maximizing some specified property of these sequences. A wide variety of methods have been proposed for biological sequence design, including evolutionary algorithms (Sinai et al., 2020; Ren et al., 2022) , reinforcement learning methods (Angermueller et al., 2019) , Bayesian optimization (Terayama et al., 2021) , search/sampling using generative models (Brookes et al., 2019; Chan et al., 2021), and GFlowNets (Jain et al., 2022) . Recently, gradient-based techniques have emerged as an effective alternative (Trabucco et al., 2021) . These approaches first train a deep neural network (DNN) on the static dataset as a proxy and then obtain the new designs by directly performing gradient ascent steps on the existing designs. Such methods have been widely used in biological sequence design (Norn et al., 2021; Tischer et al., 2020; Linder & Seelig, 2020) . One obstacle is the out-of-distribution issue, where the trained proxy model is inaccurate for the newly generated sequences. To mitigate the out-of-distribution issue, recent work proposes regularization of the model (Trabucco et al., 2021; Yu et al., 2021; Fu & Levine, 2021) or the design itself (Chen et al., 2022) . The first category focuses on training a better proxy by introducing inductive biases such as robustness (Yu et al., 2021) . The second category introduces bidirectional learning (Chen et al., 2022) , which consists of a forward mapping and a backward mapping, to optimize the design directly. Specifically, the backward mapping leverages the high-scoring design to predict the static dataset and vice versa for the forward mapping, which distills the information of the static dataset into the high-scoring design. This approach achieves state-of-the-art performances on a variety of tasks. Though effective, the proposed bidirectional learning relies on the neural tangent kernel (NTK) of an infinite-width model to yield a closed-form loss, which is a key component of its successful operation. The NTK cannot learn features due to its parameterization (Yang & Hu, 2021) and thus the bidirectional learning cannot incorporate the wealth of biophysical information from Language Models (LMs) pre-trained over a vast corpus of unlabelled sequences (Elnaggar et al., 2021; Ji et al., 2021) . To solve this issue, we construct a proxy model by combining a finite-width pre-trained LM with an additional layer. We then linearize the resultant proxy model, inspired by the recent progress in deep linearization (Achille et al., 2021; Dukler et al., 2022) . This scheme not only yields a closed-form loss but also exploits the rich biophysical information that has been distilled in the pre-trained LM. In addition, the forward mapping encourages exploitation in the sequence space and the backward mapping serves as a constraint to mitigate the out-of-distribution issue. It is important to maintain an appropriate balance between exploitation and constraint, and this can vary across design tasks as well as during the optimization process. We introduce a hyperparameter γ to control the balance, and develop a bi-level optimization framework Adaptive-γ. In this framework, we train an auxiliary model and leverage its weak supervision signal to effectively update γ. To sum up, we propose BIdirectional learning for model-based Biological sequence design (BIB). Last but not least, since the offline nature prohibits standard cross-validation strategies for hyperparameter tuning, all gradient-based offline model-based algorithms preset the learning rate η. There is a danger of a poor selection, and to address this, we propose to extend Adaptive-γ to Adaptive-η, which effectively adapts the learning rate η via the weak supervision signal from the trained auxiliary model. To the best of our knowledge, Adaptive-η is the first learning rate adaptation module for gradient-based algorithms on offline model-based optimization. Experiments on DNA and protein sequence design tasks verify the effectiveness of BIB and Adaptive-η. To summarize, our contributions are three-fold: • Instead of adopting the NTK, we propose to construct a proxy model by combining a pre-trained biological LM with an additional trainable layer. We then linearize the proxy model, leveraging the recent progress on deep linearization. This yields a closed-form loss computation in bidirectional learning and allows us to exploit the rich biophysical information distilled into the LM via pretraining over millions of biological sequences. • We propose a bi-level optimization framework Adaptive-γ where we leverage weak signals from an auxiliary model to achieve a satisfactory trade-off between exploitation and constraint. • We further extend this bi-level optimization framework to Adaptive-η. As the first learning rate tuning scheme in offline model-based optimization, Adaptive-η allows learning rate adaptation for any gradient-based algorithm.

2. PRELIMINARIES 2.1 OFFLINE MODEL-BASED OPTIMIZATION

Offline model-based optimization aims to find a design X to maximize some unknown objective f (X). This can be formally written as, X * = arg max X f (X) , where we have access to a size-N dataset D = {(X 1 , y 1 )}, • • • , {(X N , y N )} with X i representing a certain design and y i denoting the design score. In this paper, X i represents a biological sequence design, including DNA and protein sequences, and y i represents a property of the biological sequence such as the fluorescence level of the green fluorescent protein (Sarkisyan et al., 2016) .

2.2. BIOLOGICAL SEQUENCE REPRESENTATION

Following (Norn et al., 2021; Killoran et al., 2017; Linder & Seelig, 2021) , we adopt the positionspecific scoring matrix to represent a length-L protein sequence as X ∈ R L×20 , where 20 represents 20 different kinds of amino acids. For a real-world protein sequence, X[l, :] (0 ≤ l ≤ L -1) is a

