PROTEIN SEQUENCE DESIGN IN A LATENT SPACE VIA MODEL-BASED REINFORCEMENT LEARNING Anonymous

Abstract

Proteins are complex molecules responsible for different functions in the human body. Enhancing the functionality of a protein and/or cellular fitness can significantly impact various industries. However, their optimization remains challenging, and sequences generated by data-driven methods often fail in wet lab experiments. This study investigates the limitations of existing model-based sequence design methods and presents a novel optimization framework that can efficiently traverse the latent representation space instead of the protein sequence space. Our framework generates proteins with higher functionality and cellular fitness by modeling the sequence design task as a Markov decision process and applying model-based reinforcement learning. We discuss the results in a comprehensive evaluation of two distinct proteins, GFP and His3, along with the predicted structure of optimized sequences using deep learning-based structure prediction.

1. INTRODUCTION

Proteins mediate the fundamental processes of cellular fitness and life. Iterated mutations on various proteins and natural selection during the biological evolution diversify traits, eventually accumulating beneficial phenotypes. Similarly, in protein engineering and design, the directed evolution of proteins has proved to be an effective strategy for improving or altering the proteins' functions or cellular fitness for industrial, research, and therapeutic applications (Yang et al., 2019; Huang et al., 2016) . However, the protein sequence space of possible combinations of 20 amino acids is too large to search exhaustively in the laboratory, even with high-throughput screening from the diversified library (Huang et al., 2016) . In other words, directed evolution becomes trapped at local fitness maxima where library diversification is insufficient to cross fitness valleys and access neighboring fitness peaks. Moreover, functional sequences in this vast space of sequences are rare and overwhelmed by nonfunctional sequences. To tackle the limitations, data-driven methods have been applied to protein sequence design. They used reinforcement learning (RL) (Angermueller et al., 2019) , Bayesian optimization (Wu et al., 2017; Belanger et al., 2019; Terayama et al., 2021; Stanton et al., 2022) , and generative models (Jain et al., 2022; Kumar & Levine, 2020; Hoffman et al., 2022) in a model-based fashion, i.e., using a protein functionality predictor trained on experimental data to model the local landscape. Despite the advances obtained by these methods, it is still challenging to generate optimized sequences that are experimentally validated. We suggest that the cause for this is two-fold. The first cause is that the optimization process is usually performed by generating candidate sequences directly through amino acid substitutions (Belanger et al., 2019) or additions (Angermueller et al., 2019) . Given the vast search space, these methodologies are highly computationally inefficient and commonly lead to the exploration of parts of the space with a low chance of having functional proteins. In designing biological sequences, previous literature explored optimizing a learned latent representation space (Gómez-Bombarelli et al., 2018; Stanton et al., 2022) . Similarly, in this paper we investigate the optimization of sequences via RL directly in a latent representation space rather than in the protein sequence space. Actions, e.g., small perturbations in the latent vector, taken in this representation space can intuitively be understood as walking through a local functionality/fitness landscape. The second cause is related to the models used as an oracle for optimization. These models are trained on experimental data obtained for a specific function, covering only a small portion of the 1

