BIDIRECTIONAL LEARNING FOR OFFLINE MODEL-BASED BIOLOGICAL SEQUENCE DESIGN Anonymous

Abstract

Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. In this paper, we focus on biological sequence design to maximize some sequence score. A recent approach employs bidirectional learning, combining a forward mapping for exploitation and a backward mapping for constraint, and it relies on the neural tangent kernel (NTK) of an infinitely wide network to build a proxy model. Though effective, the NTK cannot learn features because of its parametrization, and its use prevents the incorporation of powerful pre-trained Language Models (LMs) that can capture the rich biophysical information in millions of biological sequences. We adopt an alternative proxy model, adding a linear head to a pre-trained LM, and propose a linearization scheme. This yields a closed-form loss and also takes into account the biophysical information in the pre-trained LM. In addition, the forward mapping and the backward mapping play different roles and thus deserve different weights during sequence optimization. To achieve this, we train an auxiliary model and leverage its weak supervision signal via a bi-level optimization framework to effectively learn how to balance the two mappings. Further, by extending the framework, we develop the first learning rate adaptation module Adaptive-η, which is compatible with all gradient-based algorithms for offline model-based optimization. Experimental results on DNA/protein sequence design tasks verify the effectiveness of our algorithm. Our code is available here.

1. INTRODUCTION

Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. This offline setting is realistic since in many real-world scenarios we do not have interactive access to the ground-truth evaluation. The design tasks of interest include material, aircraft, and biological sequence (Trabucco et al., 2021) . In this paper, we focus on biological sequence design, including DNA sequence and protein sequence, with the goal of maximizing some specified property of these sequences. A wide variety of methods have been proposed for biological sequence design, including evolutionary algorithms (Sinai et al., 2020; Ren et al., 2022) , reinforcement learning methods (Angermueller et al., 2019) , Bayesian optimization (Terayama et al., 2021) , search/sampling using generative models (Brookes et al., 2019; Chan et al., 2021) , and GFlowNets (Jain et al., 2022) . Recently, gradient-based techniques have emerged as an effective alternative (Trabucco et al., 2021) . These approaches first train a deep neural network (DNN) on the static dataset as a proxy and then obtain the new designs by directly performing gradient ascent steps on the existing designs. Such methods have been widely used in biological sequence design (Norn et al., 2021; Tischer et al., 2020; Linder & Seelig, 2020) . One obstacle is the out-of-distribution issue, where the trained proxy model is inaccurate for the newly generated sequences. To mitigate the out-of-distribution issue, recent work proposes regularization of the model (Trabucco et al., 2021; Yu et al., 2021; Fu & Levine, 2021) or the design itself (Chen et al., 2022) . The first category focuses on training a better proxy by introducing inductive biases such as robustness (Yu et al., 2021) . The second category introduces bidirectional learning (Chen et al., 2022) , which consists of a forward mapping and a backward mapping, to optimize the design directly. Specifically, the backward mapping leverages the high-scoring design to predict the static dataset and vice versa for the forward mapping, which distills the information of the static dataset into the high-scoring design. This approach achieves state-of-the-art performances on a variety of tasks. Though effective, the proposed bidirectional learning relies on the neural tangent kernel (NTK) of an infinite-width model to yield a closed-form loss, which is a key component of its successful operation. The NTK cannot learn features due to its parameterization (Yang & Hu, 2021) and thus the bidirectional learning cannot incorporate the wealth of biophysical information from Language Models (LMs) pre-trained over a vast corpus of unlabelled sequences (Elnaggar et al., 2021; Ji et al., 2021) . To solve this issue, we construct a proxy model by combining a finite-width pre-trained LM with an additional layer. We then linearize the resultant proxy model, inspired by the recent progress in deep linearization (Achille et al., 2021; Dukler et al., 2022) . This scheme not only yields a closed-form loss but also exploits the rich biophysical information that has been distilled in the pre-trained LM. In addition, the forward mapping encourages exploitation in the sequence space and the backward mapping serves as a constraint to mitigate the out-of-distribution issue. It is important to maintain an appropriate balance between exploitation and constraint, and this can vary across design tasks as well as during the optimization process. We introduce a hyperparameter γ to control the balance, and develop a bi-level optimization framework Adaptive-γ. In this framework, we train an auxiliary model and leverage its weak supervision signal to effectively update γ. To sum up, we propose BIdirectional learning for model-based Biological sequence design (BIB). Last but not least, since the offline nature prohibits standard cross-validation strategies for hyperparameter tuning, all gradient-based offline model-based algorithms preset the learning rate η. There is a danger of a poor selection, and to address this, we propose to extend Adaptive-γ to Adaptive-η, which effectively adapts the learning rate η via the weak supervision signal from the trained auxiliary model. To the best of our knowledge, Adaptive-η is the first learning rate adaptation module for gradient-based algorithms on offline model-based optimization. Experiments on DNA and protein sequence design tasks verify the effectiveness of BIB and Adaptive-η. To summarize, our contributions are three-fold: • Instead of adopting the NTK, we propose to construct a proxy model by combining a pre-trained biological LM with an additional trainable layer. We then linearize the proxy model, leveraging the recent progress on deep linearization. This yields a closed-form loss computation in bidirectional learning and allows us to exploit the rich biophysical information distilled into the LM via pretraining over millions of biological sequences. • We propose a bi-level optimization framework Adaptive-γ where we leverage weak signals from an auxiliary model to achieve a satisfactory trade-off between exploitation and constraint. • We further extend this bi-level optimization framework to Adaptive-η. As the first learning rate tuning scheme in offline model-based optimization, Adaptive-η allows learning rate adaptation for any gradient-based algorithm.

2.1. OFFLINE MODEL-BASED OPTIMIZATION

Offline model-based optimization aims to find a design X to maximize some unknown objective f (X). This can be formally written as, X * = arg max X f (X) , where we have access to a size-N dataset D = {(X 1 , y 1 )}, • • • , {(X N , y N )} with X i representing a certain design and y i denoting the design score. In this paper, X i represents a biological sequence design, including DNA and protein sequences, and y i represents a property of the biological sequence such as the fluorescence level of the green fluorescent protein (Sarkisyan et al., 2016) .

2.2. BIOLOGICAL SEQUENCE REPRESENTATION

Following (Norn et al., 2021; Killoran et al., 2017; Linder & Seelig, 2021) , we adopt the positionspecific scoring matrix to represent a length-L protein sequence as X ∈ R L×20 , where 20 represents 20 different kinds of amino acids. For a real-world protein sequence, X[l, :] (0 ≤ l ≤ L -1) is a 2022) where (X l , y l ) denotes the static dataset, y h is a large predefined target score and X h is the high-scoring design we aim to find. one-hot vector denoting one kind of amino acid. During optimization, X[l, :] is a continuous vector and sof tmax(X[l, :]) represents the probability distribution of all 20 amino acids in the position l. Similarly, for a DNA sequence, we have X ∈ R L×4 where 4 represents 4 different DNA bases. The protein sequence X is fed into the embedding layer of the LM, which produces the embedding, e = EM B(sof tmax(X)) . (2) The main block of the LM takes e as input and outputs biophysical features. The DNA LM, which adopts the k-mer representation, is a little different from protein LMs. See Appendix A.1 for details.

2.3. GRADIENT ASCENT ON SEQUENCE

A common approach to the posed offline model-based optimization problem is to train a proxy f θ (X) on the offline dataset, θ * = arg min θ 1 N N i=1 (f θ (X i ) -y i ) 2 . (3) Then we can obtain the high-scoring design X h by T gradient ascent steps: X t+1 = X t + η∇ X f θ * (X)| X=Xt , for t ∈ [0, T -1] , where the high-scoring design X h can be obtained as X T . Considering the discrete nature of biological sequences, the input of f θ (•) should be discrete one-hot vectors. Following (Norn et al., 2021) , we can perform the conversion and predict the score via: Xi = sof tmax(X i ) , Z i = onehot(argmax( Xi )) , (6) ŷ = f θ (Z i ) . Then the gradient regarding X i can be approximated as, df θ (Z i ) dx i ≈ df θ (Z i ) dz i d xi dx i , where we unroll the matrices X i , Xi and Z i as vectors x i , xi and z i for notational convenience. This approximation allows us to use backpropagation directly from the proxy to the sequence design X i . For brevity, we will still use f θ (X i ) to represent the proxy.

2.4. BIDIRECTIONAL LEARNING

As shown in Figure 1 , bidirectional learning (Chen et al., 2022) , consists of two mappings: the forward mapping leverages the static dataset (X l , y l ) to predict the score y h of the high-scoring ∥y h -f h θ (X h )∥ 2 + β∥θ∥ 2 . ( ) The high-scoring design X h can be optimized by minimizing the bidirectional learning loss L(X h ) = L l2h (X h ) + L h2l (X h ).

3. METHOD

In this section, we first illustrate how to leverage deep linearization to compute the bidirectional learning loss in a closed form. Subsequently, we introduce a hyperparameter γ to control the balance between the forward mapping and the backward mapping. We then develop a novel bi-level optimization framework Adaptive-γ, which leverages a weak supervision signal from an auxiliary model to effectively update γ. Last but not least, we extend this framework to Adaptive-η, which enables us to adapt the learning rate η for all gradient-based offline model-based algorithms. We summarize our method in Algorithm 1.

3.1. DEEP LINEARIZATION FOR BIDIRECTIONAL LEARNING

In bidirectional learning, the backward mapping loss is intractable for a finite neural network, so Chen et al. ( 2022) employ a neural network with infinite width, which yields a closed-form loss via the NTK. This however makes it impossible to incorporate the rich biophysical information that has been distilled into a pre-trained LM (Yang & Hu, 2021) . Considering this, we construct a proxy model by combining a finite-width pre-trained LM with an additional layer. We then linearize the resultant proxy model, inspired by the recent progress in deep linearization which has established that an overparameterized DNN model is close to its linearization (Achille et al., 2021; Dukler et al., 2022) . Denote by θ 0 = (θ pt , θ lin init ) ∈ R D×1 the proxy model parameters derived by combining the parameters of the pre-trained LM θ pt and a random initialization of the linear layer θ lin init . In this paper, we adopt the pre-trained DNABERT (Ji et al., 2021) and Prot-BERT (Elnaggar et al., 2021) models, and compute the average of token embeddings as the extracted feature, which is fed into the linear layer to build the proxy. Then we can construct a linear approximation for the proxy model: f θ (X) ≈ f θ0 (X) + ▽ θ f θ0 (X) • (θ -θ 0 ) , (13) where f θ (X), f θ0 (X) ∈ R, ▽ θ f θ0 (X) ∈ R 1×D and ▽ θ f θ0 (X) ∈ R D×1 . Intuitively, if the fine-tuning does not significantly change θ 0 , then this linearization is a good approximation. By leveraging this linearization, we can obtain a closed-form solution for Eq.( 12) as: θ * (X h ) = (▽ θ f θ0 (X h ) ⊤ ▽ θ f θ0 (X h ) + βI) -1 ▽ θ f θ0 (X h )⊤(y h -f θ0 (X h )) + θ 0 . (14) Building on this result, we can compute the bidirectional learning loss as: L bi (X h ) = 1 2 (∥y h -K X h X l (K X l X l + βI) -1 (y l -f θ0 (X l ))∥ 2 + ∥y l -K X l X h (K X h X h + βI) -1 (y h -f θ0 (X h ))∥ 2 ) , where K(X i , X j ) = ▽ θ f θ0 (X i )⊤ ▽ θ f θ0 (X j ). Following (Dukler et al., 2022) , we can also only linearize the last layer of the network for simplicity, which defines the following kernel, K(X i , X j ) = BERT (X i ) ⊤ BERT (X j ), where BERT (X) denotes the feature of the sequence X extracted by BERT. Its kernel nature makes this approach suitable for small-data tasks (Arora et al., 2020) , especially in drug discovery where the labeling cost of DNA/proteins is high. Optimize X by minimizing the bidirectional learning loss L bi (X τ , γ) in Eq. ( 17): 7: X τ +1 = X τ -ηOP T (∇ X L bi (X τ , γ)) 8: Return X * h = X T 3.2 ADAPTIVE-γ The forward mapping and the backward mapping play different roles in the sequence optimization process: the forward mapping encourages the high-scoring sequence to search for a higher target score (exploitation) and the backward mapping serves as a constraint. Since different sequences require different degrees of constraint, we introduce an extra hyperparameter γ ∈ [0, 1] to control the balance between the corresponding terms in the loss function: L bi (X h , γ) = γL l2h (X h ) + (1 -γ)L h2l (X h ) . ( ) Thus γ = 1.0 corresponds to the forward mapping alone, γ = 0 results in backward mapping, and γ = 0.5 leads to the bidirectional loss of (Chen et al., 2022) . It is non-trivial to determine the most suitable value for γ since we do not know the ground-truth score for a new design. One possible solution is to train an auxiliary f aux (•) to serve as a proxy evaluation. A reasonable auxiliary is a simple regression model fitted to the offline dataset. Although this auxiliary model cannot yield ground-truth scores, it can provide weak supervision signals to update γ, since the auxiliary model and the bidirectional learning provide complementary information. This is similar to co-teaching (Han et al., 2018) where two models leverage each other's view. Formally, we introduce the Adaptive-γ framework. Given a good choice of γ, the produced X h is expected to have a high score f aux (X h ), based on which we can choose γ. To make the search for γ more efficient, we can formulate this process as a bi-level optimization problem: γ * = arg max γ f aux (X * h (γ)) , .t. X * h (γ) = arg min X h L bi (X h , γ) . We can then use the hyper-gradient ∂faux(X * h (γ)) ∂γ to update γ. Specifically, the inner level solution can be approximated via a gradient descent step with a learning rate η: X * h (γ) = X h -η dL bi (X h , γ) dX ⊤ h . ( ) For the outer level, we update γ by hyper-gradient ascent: γ = γ + η ′ df aux (X * h (γ)) dγ = γ + η ′ df aux (X h ) dx h dL bi (X h , γ) dx ⊤ h , where we unroll the matrix X h as a vector x h for better illustration.

3.3. ADAPTIVE-η

We now extend the Adaptive-γ framework to Adaptive-η. As the first learning rate adaptation module for offline model-based optimization, Adaptive-η is compatible with all gradient-based algorithms and can effectively finetune the learning rate η via the auxiliary model's weak supervision signal. All gradient-based methods that maximize L θ (X) with respect to X have the following general form: X t+1 = X t + ηOP T (∇ X L θ (X)| X=Xt ) , for t ∈ [0, T -1] , where η represents the learning rate of the optimizer. For methods such as simple gradient ascent (Grad), COMs (Trabucco et al., 2021) , ROMA (Yu et al., 2021) and NEMO (Fu & Levine, 2021) , L θ (•) is related to the proxy model f θ (•); for BDI Chen et al. (2022) and our proposed method, BIB, L θ (•) is the negative of the bidirectional learning loss, i.e., L θ = -L bi . Though the learning rate η can be adapted in some optimizers such as Adam (Kingma & Ba, 2015) , these adaptations rely on only the past optimization history and do not consider the weak supervision signal from the auxiliary model. Our Adaptive-η optimizes η by solving: η * = arg max η f aux (X * h (η)) , where η can be updated via gradient ascent methods. Considering the sequence optimization procedure is highly sensitive to the learning rate η, we reset η to η 0 at each iteration and update η from η 0 , η = η 0 -η ′ df aux (X * h (η)) dη . In general, this serves to stabilize the optimization procedure.

4. EXPERIMENTS

We conduct extensive experiments on DNA and protein design tasks, and aim to answer three research questions: (1) How does BIB compare with state-of-the-art algorithms? (2) Is every design component necessary in BIB? (3) Does the Adaptive-η module improve gradient-based methods?

4.1. BENCHMARK

We conduct experiments on two DNA tasks: TFBind8(r) and TFBind10(r), following (Chen et al., 2022) and three protein tasks: avGFP, AAV and E4B, in (Ren et al., 2022) which have the most data points. See See Appendix A.2 for details. Following (Trabucco et al., 2021) , we select the top N = 128 most promising sequences for each comparison method. Among these sequences, we report the maximum normalized ground truth score as the evaluation metric following (Ren et al., 2022) .

4.2. COMPARISON METHODS

We compare BIB with two groups of baselines: the gradient-based methods and the non-gradientbased methods. For a fair comparison, the pre-trained LM is used for all methods involving a proxy and we don't finetune the LM. The gradient-based methods include: 1) Grad: gradient ascent on existing sequences to obtain new sequences; 2) COMs (Trabucco et al., 2021) : lower bounds the DNN model by the ground-truth values and then applies gradient ascent; 3) ROMA (Yu et al., 2021) : incorporates a smoothness prior into the DNN model before gradient ascent steps; 4) NEMO (Fu & Levine, 2021) : leverages the normalized maximum-likelihood estimator to bound the distance between the DNN model and the ground-truth values; 5) BDI (Chen et al., 2022) : adopts the infinitely wide neural network and its NTK to yield a closed-form bidirectional learning loss. The non-gradient-based methods include: 1) BO-qEI (Wilson et al., 2017) : builds an acquisition function for sequence exploration; 2) CMA-ES (Hansen, 2006) : estimates the covariance matrix to adjust the sequence distribution towards the high-scoring region; 3) AdaLead (Sinai et al., 2020) : performs a hill-climbing search on the proxy and then queries the sequences with high predictions; 4) CbAS (Brookes et al., 2019) : builds a generative model for sequences above a property threshold and gradually adapts the distribution by increasing the threshold; 5) PEX (Ren et al., 2022) : prioritizes the evolutionary search for protein sequences with low mutation counts; 6) GENH (Chan et al., 2021) : enhances the score through a learned latent space.

4.3. TRAINING DETAILS

We follow the training setting in (Chen et al., 2022) if not specified. We choose OP T as the Adam optimizer (Kingma & Ba, 2015) for all gradient-based methods. We implement the auxiliary model as a linear layer with the feature from the pre-trained LM. We set the number of iterations T as 25 for all experiments following (Norn et al., 2021) and η 0 as 0.1 following (Chen et al., 2022) . We run every setting over 16 trials and report the average score. See Appendix A.3 for other details.

4.4. RESULTS AND ANALYSIS

We report all experimental results in Table 1 and plot the ranking statistics in Figure 2 . We make the following observations. (1) As shown in Table 1 , BIB consistently outperforms the Grad method on all tasks, which demonstrates that our BIB can effectively mitigate the out-of-distribution issue. (2) Furthermore, BIB outperforms BDI on 4 out of 5 tasks, which demonstrates the effectiveness of the pre-trained biological LM over NTK. The reason why BDI outperforms BIB on TFBind10(r) may be that short sequences do not rely much on the rich sequential information from the pre-trained LM. (3) As shown in Figure 2 , the gradient-based methods generally perform better than the non-gradient-based methods, as also observed by Trabucco et al. (2021) . (4) The gradient-based methods are inferior for the AAV task. One possible reason is that the design space of AAV (20 28 ) is much smaller than those of avGFP (20 239 ) and E4B (20 102 ), which makes the generative modeling and evolutionary algorithms more suitable. (5) This conjecture is also supported by the experimental results on two DNA design tasks. We compute the average ranking of gradient-based methods and non-grad-based methods on TFBind10(r) as 3.5 and 9.5, respectively, and the average ranking of gradient-based methods and non-grad-based methods on TFBind8(r) as 5.8 and 6.8, respectively. The advantage of gradient-based methods are larger (9.5 -3.5 = 6.0) in TFBind10(r) than that (6.8 -5.8 = 1.0) in TFBind8(r). ( 6) The generative modeling methods CbAS and GENH yield poor results on all tasks, probably because the high-dimensional data distribution is very hard to model. ( 7) Overall, BIB attains the best performance in 3 out of 5 tasks and achieves the best ranking results as shown in Table 1 and Figure 2. We also visualize the trend of performance (the maximum normalized ground truth score) and tradeoff γ as a function of T on TFBind8(r) in Figure 3 (a) and avGFP in Figure 3(b) . The performance generally increases with the time step T and then stabilizes, which demonstrates the effectiveness and robustness of BIB. Furthermore, we find that the γ values of TFBind8(r) and avGFP generally increase at first. This means that BIB reduces the impact of the constraint to encourage a more aggressive search for a high target value during the initial phase. Then γ of TFBind8(r) continues to increase while the γ of avGFP decreases. We conjecture that the difference is caused by the sequence length. Small mutations of a biological sequence are enough to yield a good candidate (Ren et al., 2022) . For the length-239 protein in avGFP, dramatic mutations 1) are not necessary and 2) can easily lead to out-of-distribution points. The weak supervision signal from the auxiliary model therefore encourages a tighter constraint towards the static dataset. By contrast, the DNA sequence is relatively short and a more widespread search of the sequence space can yield better results. To investigate this conjecture, we further visualize the trend of E4B in Figure 3(c) . E4B also has long sequences (102) and we can observe its similar first-increase-then-decrease trend, although it is not as pronounced. In this subsection, we conduct ablation studies to verify the effectiveness of the forward mapping, the backward mapping, and the Adaptive-γ module of BIB. We report the experimental results in Table 2 . 1 Forward mapping & Backward mapping. We can observe that bidirectional learning (γ = 0.5) performs better than both forward mapping (γ = 1.0) and backward mapping (γ = 0.0) alone in most tasks, which demonstrates the effectiveness of forward mapping and backward mapping. The advantage of bidirectional mappings over the forward mapping is larger in the long-sequence tasks like avGFP (238) and E4B (102) compared with the short-sequence tasks. A possible explanation is that the constraint is more important for long sequence tasks than short sequence design since the search space is large and many mutations can easily go out of distribution. Adaptive-γ. BIB learns γ and this leads to improvements over bidirectional mappings (γ = 0.5) for all tasks, verifying the effectiveness of Adaptive-γ. We also consider the following variant, X * = arg min X h L bi (X h , 0.5) -f aux (X h ) , which jointly optimizes the bidirectional learning loss L bi (X h , 0.5) and the auxiliary term f aux (X h ). We found this yields similar or even worse results than pure bidirectional learning. The reason may be that the weak supervision signal from f aux (X h ) can serve as a guide to update the scalar γ but not as a component of the main optimization objective that directly updates the sequence. In the final column of Table 2 , we examine the performance of the Adaptive-η module. Adding this module leads to improvements on all five tasks, which demonstrates its effectiveness.

4.6. ADAPTIVE-η

In this subsection, we aim to further demonstrate the effectiveness of the Adaptive-η module on all six gradient-based methods. We conduct experiments on two tasks: TFBind8(r) and avGFP. Since the use of the infinitely wide neural network leads to poor performance for BDI, we modify its implementation via deep linearization so that it can make use of the pre-trained LM. As shown in Table 3 , Adaptive-η provides a consistent gain for all scenarios, which demonstrates the widespread applicability and effectiveness of the module. Furthermore, Adaptive-η leads to a maximum improvement of 1.4% in TFBind8(r) and 12.5% in avGFP. ROMA is the algorithm that benefits the most. One possible explanation is that ROMA incorporates a local smoothness prior that leads to more stable gradients, with which Adaptive-η can be more effective. Similar to Sec 4.5, we  X * = arg max X h L θ (X h ) + f aux (X h ) , which performs joint optimization instead of bi-level optimization on two objectives. As shown in Table 3 , joint optimization generally deteriorates the performance. This again verifies that the auxiliary model can only serve as a guide instead of contributing to the main objective.

5. RELATED WORK

Biological sequence design. There has been a wide range of algorithms for biological sequence design. Evolutionary algorithms (Sinai et al., 2020; Ren et al., 2022) leverage the learned surrogate model to provide evolution guidance towards the high-scoring region. Angermueller et al. (2019) propose a flexible reinforcement learning framework where sequence design is a sequential decisionmaking problem. Bayesian optimization methods propose candidate solutions via an acquisition function (Terayama et al., 2021) . Deep generative model methods design sequences in the latent space (Chan et al., 2021) or gradually adapt the distribution towards the high-scoring region (Brookes et al., 2019) . GFlowNets (Jain et al., 2022) amortize the cost of search over learning and encourage diversity. Gradient-based methods leverage a surrogate model and its gradient information to maximize the desired property (Chen et al., 2022; Norn et al., 2021; Tischer et al., 2020; Linder & Seelig, 2020) . Our proposed BIB belongs to the last category and leverages the rich biophysical information (Ji et al., 2021; Elnaggar et al., 2021) to directly optimize the biological sequence. Offline model-based optimization. A majority of sequence design algorithms (Angermueller et al., 2019; Sinai et al., 2020; Ren et al., 2022) focus on the online setting where wet-lab experimental results in the current round are analyzed to propose candidates in the next round. The problem of this setting is that wet-lab experiments are often very expensive, and thus a pure data-driven, offline approach is attractive and has received substantial research attention recently (Trabucco et al., 2022; Kolli et al., 2022) . Gradient-based methods have proven to be effective (Trabucco et al., 2021; Yu et al., 2021; Fu & Levine, 2021; Chen et al., 2022) . Among these algorithms, Chen et al. (2022) propose bidirectional mappings to distill information from the static dataset into a high-scoring design, which achieves state-of-the-art performances on a variety of tasks. However, this bidirectional learning is designed for general tasks, like robot and material design, and the rich biophysical information in millions of biological sequences is ignored. In this paper, we leverage recent advances in deep linearization to incorporate the rich biophysical information into bidirectional learning.

6. CONCLUSION

In this paper, we propose bidirectional learning for offline model-based biological sequence. Our work is built on the recently proposed bidirectional learning approach (Chen et al., 2022) , which is designed for general inputs and relies on the NTK of an infinitely wide network to yield a closed-form loss computation. Though effective, the NTK cannot learn features. We build a proxy model using the pre-trained LM model with a linear head and apply the deep linearization scheme to the proxy, which can yield a closed-form loss and incorporate the wealth of biophysical information at the same time. In addition, we propose Adaptive-γ to maintain a proper balance between the forward mapping and the backward mapping by leveraging the weak supervision signal from an auxiliary model. Based on this framework, we further propose Adaptive-η, the first learning rate adaptation strategy compatible with all gradient-based offline model-based algorithms. Experimental results on DNA and protein sequence design tasks verify the effectiveness of BIB and Adaptive-η.

7. ETHICS STATEMENT

Protein sequence design aims to find a protein sequence with a particular biological function, which has a broad application scope. This can lead to improved drugs that are highly beneficial to society. For instance, designing the antibody protein for SARS-COV-2 can potentially save millions of human lives (Kumar et al., 2021) and designing novel anti-microbial peptides (short protein sequences) is central to tackling the growing public health risks caused by anti-microbial resistance (Murray et al., 2022) . Unfortunately, it is possible to direct the research results towards harmful purposes such as the design of biochemical weapons. As researchers, we believe that we must be aware of the potential harm of any research outcomes, and carefully consider whether the possible benefits outweigh the risks of harmful consequences. We also must recognize that we cannot control how the research may be used. In the case of this paper, we are confident that there is a much greater chance that the research outcomes will have a beneficial effect. We do not consider that there are any immediate ethical concerns with the research endeavour. which consists of an encoder and a decoder, while the regression model adopts the Prot-BERT model which only has an encoder. Second, Prot-T5 is trained on the BFD and UniRef100 datasets and ProtBert is trained on the UniRef50 dataset. These two points demonstrate that the oracle and the regression model are different function classes. We choose the Prot-T5 model as the oracle because this is the state-of-the-art protein LM to extract features and recent work Elnaggar et al. (2021) has demonstrated its effectiveness. In order to test how related the Prot-T5 (oracle)/Prot-BERT(proxy) models are, we trained them on a sampled training dataset and compared the test predictions of the testing set. By evaluating the Pearson correlation coefficient (PCC) between the two prediction errors PCC(ProtT5 predictions -test labels, ProtBERT predictions -test labels), we obtain -0.0053 on avGFP, -0.0005 on AAV, and -0.0062 on E4B. These results suggest that the two models are not strongly related in terms of the predictions they form. Following (Trabucco et al., 2021) , we select the top N = 128 most promising sequences for each comparison method. Among these sequences, we report the maximum normalized ground truth score as the evaluation metric following (Ren et al., 2022) .

A.3 TRAINING DETAILS

We use Pytorch (Paszke et al., 2019) to run all experiments on one V100 GPU. Following the setting in Norn et al. ( 2021), we introduce a length-L protein sequence as a continuous random matrix X h ∈ R L×20 (X h ∈ R L×4 for DNA), initialized using a normal distribution with the mean 0 and the standard deviation of 0.01. To make this sequence correspond correctly to the candidate sequence, we exchange the largest value in X[l, :] with the value in the amino acid index.

A.4 DIFFERENT PRETRAINED LMS

As shown in Table 5 , we have tested the ProtBERT, ProtAlbert, and ProtBert-BFD models and found that better-quality models generally work better. The publicly available pre-trained DNA models are limited and thus we only perform experiments on the protein tasks. Elnaggar et al. (2021) demonstrate that the language model performances follow the ordering: ProtBert-BFD > ProtBert > ProtAlbert. We can see that the performance ranks over the three protein tasks avGFP, AAV, and E4B are the same.

A.5 DIFFERENT DATASET SIZE

As shown in Table 6 , we have tested the performance of BDI as a function of dataset size (N= 20, 40, 60, 80, 100) in TFBind8(r) and TFBind10(r) since they have exact oracle evaluations. We see that performance is already good for N=20 for TFBind8(r) and N=40 for TFBind10(r).

A.6 RANKING PERFORMANCE

As for prediction performances, the rank should be: a NN > linearized pre-trained LM > NTK. We have conducted experiments to verify this. We sample half of the data, train a model to predict another



Figure 1: Illustration of bidirectional learning Chen et al. (2022) where (X l , y l ) denotes the static dataset, y h is a large predefined target score and X h is the high-scoring design we aim to find.

Bidirectional Learning for Offline Model-based Biological Sequence Design Input: Static dataset D = (X l , y l ), predefined target score y h = 10, # of iterations T , pre-trained biological LM parameterized by θ 0 , auxiliary model f aux (•), regularization β. Output: High-scoring design X * h . 1: Initialize X 0 as the sequence with the highest score in D 2: for τ ← 0 to T -1 do 3: Leverage Adaptive-γ in Sec 3.2 to update the balance γ by Eq. (21) 4:if Adapt learning rate then 5: Leverage Adaptive-η in Sec 3.3 to update the learning rate η by Eq. (23) 6:

Figure 2: Rank minima and maxima are represented by whiskers; vertical lines and black triangles denote medians and means.

Figure 3: Trend of performance and trade-off γ as a function of T .

Experimental results (maximum normalized ground truth score) for comparison.

Ablation studies on BIB components.

Adaptive-η on all gradient-based methods.

Dataset details. Max of D Min of D entire Max of D entire

Experimental results on different pre-trained LMs for comparison. ± 0.456 0.478 ± 0.004 0.552 ± 0.023 ProtBert (adopted) 8.084 ± 0.224 0.501 ± 0.007 1.255 ± 0.029 ProtBert-BFD 8.240 ± 0.094 0.549 ± 0.009 1.880 ± 0.054

8. REPRODUCIBILITY STATEMENT

We provide the code implementation of BIB and Adaptive-η here and we also attach the code in the supplementary material. We describe the DNA/protein benchmarks in Sec. 4.1 and the training details in Sec. 4.3. We also explain how to obtain the sequence embedding from the pre-trained LM and how to perform gradient ascent steps on the sequence in Sec. 2.

A APPENDIX

A.1 DNA EMBEDDING To incorporate richer contextual information, the DNA LM Ji et al. (2021) adopts the k-mer sequence representation, which is widely used in DNA sequence analysis. For example, the sequence AT GGCT has its 3-mer representation as {AT G, T GG, GGC, GCT }. In this paper, we adopt its 3-mer representation and compute the probability of the 3-mer token by multiplying the probabilities of the three individual bases. The 3-mer representation is then sent to the pre-trained DNA LM.

A.2 DATASET DETAILS

We conduct experiments on two DNA tasks following (Chen et al., 2022) and three protein tasks in (Ren et al., 2022) which have the most data points. We report the dataset details in Table 4 .DNA Task 1 TFBind8(r). The goal is to find a length-8 DNA sequence to maximize the binding activity score with a particular transcription factor, SIX6REFR1 (Barrera et al., 2016) . We sample 5000 data points for the offline algorithms following (Chen et al., 2022) .DNA Task 2 TFBind10(r). The task TFBind10(r) is the same as TFBind8(r) except that the goal is to find a length-10 DNA sequence. Both DNA tasks measure the entire search space and we adopt these measurements as the approximate ground-truth evaluation.Protein Task 1 avGFP. This task aims to find a protein sequence with approximately 239 amino acids to maximize the fluorescence level of Green Fluorescent Proteins (Sarkisyan et al., 2016) . The task oracle is constructed by using the full unobserved dataset (around 52,000 points) following (Ren et al., 2022) . The oracle passes the average of the residue embeddings from the pre-trained Prot-T5 (Elnaggar et al., 2021) into a linear layer and then fits the dataset. The following two task oracles take the same form. The offline algorithms can only access the lowest-scoring 26,000 data points.Protein Task 2 AAV. The goal is to engineer a 28-amino acid segment (positions 561-588) of the VP1 protein to remain viable for gene therapy (Bryant et al., 2021) . We use the entire 284, 000 data points to build the oracle and the lowest-scoring 142, 000 points for the offline algorithms.Protein Task 3 E4B. This task aims to design a protein (around 102 amino acids) to maximize the ubiquitination rate to the target protein (Starita et al., 2013) . The full dataset consisting of around 100, 000 points is used to build the oracle and the bottom half is used for the offline algorithms.The parameterization of the oracle is different from that of the regression model from two aspects: 1) model architecture; 2) pre-trained information source. First, the oracle adopts the Prot-T5 model 

Method

TFBind8(r) TFBind10(r) avGFP AAV E4B Finetuned NN 0.101 ± 0.001 1.130 ± 0.041 0.411 ± 0.197 5.148 ± 0.074 0.683 ± 0.012 Linearized NN 0.107 ± 0.000 1.618 ± 0.000 0.735 ± 0.000 23.041 ± 0.000 1.050 ± 0.000 NTK 0.111 ± 0.000 1.840 ± 0.000 0.807 ± 0.000 24.451 ± 0.000 1.075 ± 0.000 half data, and report the mean squared loss here and in Appendix A.6 Table 7 . A small mean squared loss indicates a good prediction performance; thus, we have verified the above ranking order.

