PROTFIM: FILL-IN-MIDDLE PROTEIN SEQUENCE DESIGN VIA PROTEIN LANGUAGE MODELS

Abstract

Following the investigation that protein sequence determines its structure and function, engineering protein sequences allows us to optimize the functions of proteins for specific purposes such as enhancement of catalytic activity or binding affinity maturation. In protein engineering, there are many cases where the amino acids in the middle of a protein sequence are changed while maintaining the remaining residues to avoid unwanted functional changes from remaining residues. However, existing research on protein sequence design via protein language models (pLMs) has focused on modifying suffix residues by prompting prefix residues to the model or mutating the overall sequence residues. This is unsuitable for scenarios where the residues located in the middle of the sequence are to be optimized. In this work, we suggest a pLM-based framework to solve the fill-inmiddle (FIM) protein engineering tasks. To evaluate the performance of pLMs on the FIM tasks, we design a novel evaluation scheme where pLMs are tasked to generate new sequences while maintaining the secondary structures. Also, we propose a new PROTein language model specialized for the Fill-In-Middle task, Prot-FIM. Experiments confirm that ProtFIM performs FIM engineering efficiently, especially for alpha-helix structures, and provides decent protein representations of sequence-function relationships. Finally, we demonstrate an artificial protein sequence design framework composed of ProtFIM and a high-quality structure predictor as a novel tool to optimize protein sequences.

1. INTRODUCTION

Proteins play a crucial role in various parts of biological processes, and the ensemble of diverse functioning proteins is the basis of life's activities, such as immune response and metabolism. Such essential and versatile functions of proteins are encoded in protein sequences which are the arrangement of amino acid residues. The sequences determine their structures via complex biophysical interactions between residues and these structures are directly linked to the functions of proteins. Thus, optimizing the protein's function by changing amino acid residues of protein of interest, called protein engineering, has been of great interest in diverse industries such as biofuel (Wen et al., 2009) , pharmaceuticals (H Tobin et al., 2014) , and agriculture (Rao, 2008) . One of the representatives of protein sequence design methods is a mutagenesis technique, which gives evolutionarily plausible candidate protein sequence libraries with the help of genetic engineering (Arnold, 1998) . However, this approach requires substantial efforts in high-throughput screening experiments. Recently, machine learning-guided protein sequence design strategies have been proposed to achieve a more efficient sequence space search using experimentally acquired labeled data (Yang et al., 2019a) . With both advances in high-throughput sequencing technologies and language modeling in the field of natural language processing (NLP), protein language models (pLMs), which are trained in an unsupervised manner using tremendous sets of unlabeled protein sequences (Consortium, 2019), have been developed for generating de novo protein sequence (Madani et al., 2020; Hesslow et al., 2022; Moffat et al., 2022; Ferruz et al., 2022; Nijkamp et al., 2022) . Existing generative pLMs are trained using an auto-regressive (AR) strategy (Radford et al., 2019; Brown et al., 2020) , and generate sequences conditioning on the prefix protein sequences. Unfortunately, if the target region where we want to change amino acid residues is located at the front, existing pLMs uses only

