PROTFIM: FILL-IN-MIDDLE PROTEIN SEQUENCE DESIGN VIA PROTEIN LANGUAGE MODELS

Abstract

Following the investigation that protein sequence determines its structure and function, engineering protein sequences allows us to optimize the functions of proteins for specific purposes such as enhancement of catalytic activity or binding affinity maturation. In protein engineering, there are many cases where the amino acids in the middle of a protein sequence are changed while maintaining the remaining residues to avoid unwanted functional changes from remaining residues. However, existing research on protein sequence design via protein language models (pLMs) has focused on modifying suffix residues by prompting prefix residues to the model or mutating the overall sequence residues. This is unsuitable for scenarios where the residues located in the middle of the sequence are to be optimized. In this work, we suggest a pLM-based framework to solve the fill-inmiddle (FIM) protein engineering tasks. To evaluate the performance of pLMs on the FIM tasks, we design a novel evaluation scheme where pLMs are tasked to generate new sequences while maintaining the secondary structures. Also, we propose a new PROTein language model specialized for the Fill-In-Middle task, Prot-FIM. Experiments confirm that ProtFIM performs FIM engineering efficiently, especially for alpha-helix structures, and provides decent protein representations of sequence-function relationships. Finally, we demonstrate an artificial protein sequence design framework composed of ProtFIM and a high-quality structure predictor as a novel tool to optimize protein sequences.

1. INTRODUCTION

Proteins play a crucial role in various parts of biological processes, and the ensemble of diverse functioning proteins is the basis of life's activities, such as immune response and metabolism. Such essential and versatile functions of proteins are encoded in protein sequences which are the arrangement of amino acid residues. The sequences determine their structures via complex biophysical interactions between residues and these structures are directly linked to the functions of proteins. Thus, optimizing the protein's function by changing amino acid residues of protein of interest, called protein engineering, has been of great interest in diverse industries such as biofuel (Wen et al., 2009) , pharmaceuticals (H Tobin et al., 2014), and agriculture (Rao, 2008) . One of the representatives of protein sequence design methods is a mutagenesis technique, which gives evolutionarily plausible candidate protein sequence libraries with the help of genetic engineering (Arnold, 1998) . However, this approach requires substantial efforts in high-throughput screening experiments. Recently, machine learning-guided protein sequence design strategies have been proposed to achieve a more efficient sequence space search using experimentally acquired labeled data (Yang et al., 2019a) . With both advances in high-throughput sequencing technologies and language modeling in the field of natural language processing (NLP), protein language models (pLMs), which are trained in an unsupervised manner using tremendous sets of unlabeled protein sequences (Consortium, 2019), have been developed for generating de novo protein sequence (Madani et al., 2020; Hesslow et al., 2022; Moffat et al., 2022; Ferruz et al., 2022; Nijkamp et al., 2022) . Existing generative pLMs are trained using an auto-regressive (AR) strategy (Radford et al., 2019; Brown et al., 2020) , and generate sequences conditioning on the prefix protein sequences. Unfortunately, if the target region where we want to change amino acid residues is located at the front, existing pLMs uses only a few preceding amino acid residues ("prompts") for sequence generation. The interaction sites, positions that interact with other proteins or molecules to perform their functions and are mainly modified to improve functionality, are evenly located on the protein sequence. To prove this, we collect 3D protein structures from Protein Data Bank (PDB) database (Sussman et al., 1998) and calculate the relative positions of protein-protein interaction sites on the protein sequences (see details in Appendix A.1). As illustrated in Figure 5 , interacting sites are evenly present on the protein sequence. This result suggests that in protein engineering, modifying the amino acid sequence will be done for the middle part of the sequence in many cases. In this case, existing pLMs may not effectively utilize the information behind them, which can result in poor quality of generation. In this work, we regard the middle protein engineering as a fill-in-middle (FIM) sequence generation problem as in Figure 1 and investigate the possibility of pLMs in FIM protein engineering framework. With the emergence of highly accurate protein structure predictors (Jumper et al., 2021; Baek et al., 2021) , protein structures are predicted very quickly and accurately with a low cost. Using these advances, we propose a new evaluation scheme, Secondary structurE InFilling rEcoveRy, SEIFER, for FIM protein sequence generation. The secondary structures are usually desirable to be preserved (Rubio et al., 2019) since the binding pockets of other interacting proteins or molecules are fixed to some extent. In SEIFER, models are tasked to recommend protein sequences and achieve two conditions: the new sequences must be different from the original sequences and their secondary structures must be fully maintained. So, SEIFER can assess both the diversity and structure of new sequences simultaneously and we believe that SEIFER is suitable for assessing generated sequences in the field of protein engineering which modifies the amino acid residues of original sequences to improve functions. Also, inspired by the latest research in the field of language models (Bavarian et al., 2022b) , we propose a new Protein language model specialized for the Fill-In-Middle task, ProtFIM. Compared to existing pLMs, our proposed ProtFIM use both front ("prefix") and back ("suffix") sequence information during training and inference. Through SEIFER evaluation, we show that ProtFIM can generate diverse sequences while maintaining secondary structure, especially for α-helix. Furthermore, ProtFIM outperforms when engineering on residues positioned in the front part of a protein sequence compared to existing pLMs, proving that the FIM training is more suitable for FIM engineering compared to AR pLMs. Finally, through analysis and visualization, we prove that ProtFIM has decent representations of protein sequences and can serve as a sequence optimization tool accompanied by AlphaFold2. In summary, our contributions are: • We define FIM protein engineering as protein sequence infilling tasks and provide the applicability of protein language models on the task. • We propose a new evaluation scheme, SEIFER, that can be used to evaluate the performance of pLMs on protein infilling sequence design tasks by considering structural conservation. Through this evaluation, we find that existing AR pLMs are capable of sequence design having α-helix structure. • We propose a new type of pLM, ProtFIM, that has both AR and FIM capability. Comprehensive results show that ProtFIM has efficient and comparable performances in protein infilling and protein representations for protein engineering compared to other pLMs.



Figure 1: An illustrative example of FIM protein engineering. The changed sequences for the target region are generated by generative pLMs or the like, and the structures are altered accordingly.

