PROTEIN SEQUENCE DESIGN IN A LATENT SPACE VIA MODEL-BASED REINFORCEMENT LEARNING Anonymous

Abstract

Proteins are complex molecules responsible for different functions in the human body. Enhancing the functionality of a protein and/or cellular fitness can significantly impact various industries. However, their optimization remains challenging, and sequences generated by data-driven methods often fail in wet lab experiments. This study investigates the limitations of existing model-based sequence design methods and presents a novel optimization framework that can efficiently traverse the latent representation space instead of the protein sequence space. Our framework generates proteins with higher functionality and cellular fitness by modeling the sequence design task as a Markov decision process and applying model-based reinforcement learning. We discuss the results in a comprehensive evaluation of two distinct proteins, GFP and His3, along with the predicted structure of optimized sequences using deep learning-based structure prediction.

1. INTRODUCTION

Proteins mediate the fundamental processes of cellular fitness and life. Iterated mutations on various proteins and natural selection during the biological evolution diversify traits, eventually accumulating beneficial phenotypes. Similarly, in protein engineering and design, the directed evolution of proteins has proved to be an effective strategy for improving or altering the proteins' functions or cellular fitness for industrial, research, and therapeutic applications (Yang et al., 2019; Huang et al., 2016) . However, the protein sequence space of possible combinations of 20 amino acids is too large to search exhaustively in the laboratory, even with high-throughput screening from the diversified library (Huang et al., 2016) . In other words, directed evolution becomes trapped at local fitness maxima where library diversification is insufficient to cross fitness valleys and access neighboring fitness peaks. Moreover, functional sequences in this vast space of sequences are rare and overwhelmed by nonfunctional sequences. To tackle the limitations, data-driven methods have been applied to protein sequence design. They used reinforcement learning (RL) (Angermueller et al., 2019) , Bayesian optimization (Wu et al., 2017; Belanger et al., 2019; Terayama et al., 2021; Stanton et al., 2022) , and generative models (Jain et al., 2022; Kumar & Levine, 2020; Hoffman et al., 2022) in a model-based fashion, i.e., using a protein functionality predictor trained on experimental data to model the local landscape. Despite the advances obtained by these methods, it is still challenging to generate optimized sequences that are experimentally validated. We suggest that the cause for this is two-fold. The first cause is that the optimization process is usually performed by generating candidate sequences directly through amino acid substitutions (Belanger et al., 2019) or additions (Angermueller et al., 2019) . Given the vast search space, these methodologies are highly computationally inefficient and commonly lead to the exploration of parts of the space with a low chance of having functional proteins. In designing biological sequences, previous literature explored optimizing a learned latent representation space (Gómez-Bombarelli et al., 2018; Stanton et al., 2022) . Similarly, in this paper we investigate the optimization of sequences via RL directly in a latent representation space rather than in the protein sequence space. Actions, e.g., small perturbations in the latent vector, taken in this representation space can intuitively be understood as walking through a local functionality/fitness landscape. The second cause is related to the models used as an oracle for optimization. These models are trained on experimental data obtained for a specific function, covering only a small portion of the  (L,) (L,) (R,) Figure 1 : The framework's overview describes (top left) the encoder-decoder architecture trained to represent protein sequences in a latent space and (top right) the RL framework. The state is defined as the representation in the latent space, while the action is the perturbation in this representation. The perturbed representation is decoded back to a protein sequence using a sequence decoder. The reward is based on the functionality predicted by the functionality predictor (oracle). Bottom row shows three RL-based state and action modeling options. protein space, and their accuracy is consequently restricted to this small region. Later, we will demonstrate that even the most advanced model-based optimizations can assign high functionality values for randomly generated protein sequences. These random sequences, however, are unlikely to be functional. To reduce false positives, we suggest augmenting the experimental data with random sequences (i.e., negative examples) assigned to a low functionality value. Such augmentation also helps set boundaries around the experimental data distribution in which the oracle can be trusted. We model protein sequence design as a Markov Decision Process (MDP) to optimize a latent representation. Our method trains the optimization policy using a model-based deep reinforcement learning (RL) framework (Fig. 1 ). At each timestep, the policy updates the latent representation by small perturbations to maximize protein functionality or cellular fitness, i.e., walking uphill through the local landscape until the end of an episode. We evaluate our method in two evaluation tasks, optimizing the functionality of the green fluorescence protein (GFP) and the cell fitness of imidazoleglycerol-phosphate dehydratase (His3). Our results show that the proposed framework design sequences with higher protein functionality and cellular fitness than existing methods. Ablation studies show that, based on the evaluation of various model options for state and action for the RL framework, the proposed latent representation update can successfully optimize the protein and search the vast design space. We provide visual evidence that the trained policy can traverse the local functionality landscape efficiently for GFP and His3. Our method is general and can also be applied to representations learned from protein structures, such as those presented in Zhang et al. 

2. RELATED WORKS

Protein representation learning Representation learning methods aim to learn compact and expressive features describing data. Since a protein is composed of a sequence of distinct amino acids (N=20), it can be interpreted as a large word in which each character is an amino acid. Due to the similarity with natural language processing (NLP) tasks, methods such as BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019) have been used to train protein language models (Alley et al., 2019; Brandes et al., 2022; Lin et al., 2022; Ferruz et al., 2022; Rives et al., 2021) . The protein language model trained with 250 million sequences by Rives et al. (2021) has shown to learn representations containing meaningful information about biological properties and reflecting protein structure. It was shown that the learned representation could be generalized across different applications achiev-



(2022); Eguchi et al. (2022).

