HOTPROTEIN: A NOVEL FRAMEWORK FOR PROTEIN THERMOSTABILITY PREDICTION AND EDITING

Abstract

The molecular basis of protein thermal stability is only partially understood and has major significance for drug and vaccine discovery. The lack of datasets and standardized benchmarks considerably limits learning-based discovery methods. We present HotProtein, a large-scale protein dataset with growth temperature annotations of thermostability, containing 182K amino acid sequences and 3K folded structures from 230 different species with a wide temperature range -20 • C ∼ 120 • C. Due to functional domain differences and data scarcity within each species, existing methods fail to generalize well on our dataset. We address this problem through a novel learning framework, consisting of (1) Protein structure-aware pre-training (SAP) which leverages 3D information to enhance sequence-based pre-training; (2) Factorized sparse tuning (FST) that utilizes low-rank and sparse priors as an implicit regularization, together with feature augmentations. Extensive empirical studies demonstrate that our framework improves thermostability prediction compared to other deep learning models. Finally, we introduce a novel editing algorithm to efficiently generate positive amino acid mutations that improve thermostability.

1. INTRODUCTION

Overheat unfolds and deactivates Proteins. Proteins are the bio-polymers responsible for executing most biological phenomena and, through evolution, have had their sequences optimized to carry out specific functions within specific cellular environments. A protein's stability is a multidimensional property that depends on a series of factors (Pucci et al., 2017; Cao et al., 2019) such as pH, salinity, and temperature (thermostability shown in Figure 1 ), making it hard to adapt a protein to function outside of its endogenous cellular environment. Protein engineering is the field where natural proteins are mutated to improve their stability in exogenous environments and their overall fitness for a particular function. In protein engineering, one of the initial goals for most engineering campaigns is to improve the thermal stability of protein (Haki & Rakshit, 2003; Bruins et al., 2001; Frokjaer & Otzen, 2005) . Thermally stabilized proteins are more robust and therefore enable downstream applications in the food (Kapoor et al., 2017 ), biofuel (Huang et al., 2020 ), detergent (Von der Osten et al., 1993 ), chemical (Cho et al., 2015) , and pharmaceutical industry (Amara, 2013), drug design (De Carvalho, 2011; Mora & Telford, 2010) , and bioremediation of environmental pollutants (Lu et al., 2022; Alcalde et al., 2006) . Thus, to accelerate the engineering of a target protein it is critical to understand and accurately predict thermal stability changes of mutations. There has been a substantial effort from the community to quantitatively understand and model protein thermostability (e.g., Pucci et al., 2017; Cao et al., 2019; Pucci & Rooman, 2014; Li et al., 2019; Pucci & Rooman, 2017; Pouyan et al., 2022) . However, the generalizability of them is still unsatisfactory, and laborious experimental methods such as directed evolution are often preferred. To enhance the capabilities of learning-based approaches, we present a large-scale, standardized protein benchmark, i.e., HotProtein, with organism-level temperature annotations which is a lower bound of protein's melting temperature (Jarzab et al., 2020) . It consists of 182K protein sequences and 3K folded structures from 230 different species, covering a broad temperature range of -20 • C ∼ 120 • C. However, similar to Cao et al. ( 2019), naively trained deep models even on our dataset do not enable generalization to unseen proteins. The presumed reasons are (1) the considerable functional heterogeneity in proteins that arise from the environmental conditions and evolutionary history and (2) the scarcity of high-quality thermostability experimental data due to the massive cost and labor required to generate such data. To tackle these pain points, we introduce a novel algorithmic pipeline to improve thermostability prediction. First, we enrich our sequence embeddings by infusing 3D structural information in a contrastive manner-we call this structure-aware pre-training (SAP). Then, we further fine-tune our model with a factorized sparse tuning (FST) approach. Here, we utilize factorized low-rank and sparse priors as implicit regularizers and leverage feature augmentation, such as mix-up (Verma et al., 2019) and worse-case augmentations (Chen et al., 2021d) . FST greatly boosts the performance of tuned predictors, suggesting improved data efficiency and robustness against domain shifts (Li et al., 2022b; Chen et al., 2021c) . Extensive evaluations on both HotProtein and the other existing protein datasets (i.e., FireProtDB (Stourac et al., 2021) ) verify our proposals' effectiveness. Finally, to identify the top mutational predictions likely to improve thermal stability for a target protein, we develop a new optimization-based editing framework on top of a classifier or regressor, that attempts to mimic the process of directed evolution while limiting the stochasticity (Pucci & Rooman, 2014) . Unlike existing protein engineering approaches (Eijsink et al., 2005; Couñago et al., 2006; Wijma et al., 2013) that directly utilize the predictions to generate mutational designs, our proposal maximizes the model's objective to approach a more thermostable label to identify input mutated sequences. Our contributions can be summarized as follows: ⋆ We collect and present a large-scale protein dataset, i.e., HotProtein, with organism-level temperature annotations. We use the organism's environmental growth temperature to label and classify all proteins within each organism, which we use for thermostability prediction and editing. It contains 182K amino acid sequences and 3K folded 3D structures of proteins from 230 different species, covering five thermostability types, e.g.,, Cryophilic, Psychrophilic, Mesophilic, Thermophilic, and Hyperthermophilic. ⋆ We introduce a protein structure-aware pre-training by injecting 3D structural information into sequence embeddings in a contrastive fashion. It enhances the diversity and expressivity of the protein representations, resulting in improved thermostability predicting performance. ⋆ We introduce a robust and data-efficient tuning framework that performs weight updates in the factorized and sparse subspace together with augmented feature embedding. This leads to substantial performance improvements against data scarcity and severe distribution shifts. ⋆ We formulate the search for thermal stabilizing mutations as an optimization problem: for a target protein and a trained predictor, we customize an editing framework that optimizes the input protein sequences to identify thermostabilizing mutations. ⋆ Extensive experiments conducted on both thermostability prediction and protein editing tasks, consistently demonstrate the superiority of our proposals over various existing approaches (Rives et al., 2021) . For example, when fine-tuned on experimentally determined T m dataset, Fire-ProtDB, our editing suggester achieves 53.93% (↑ 8.96%) precision in positive mutation classification, 50.79 (↑ 6.54) Spearman ρ correlation coefficient in the temperature regression, and 54.24% (↑ 1.83%) successful rate in generating positive single mutations.

2. RELATED WORKS

Protein Thermostability Prediction. To enhance a protein's stability, ∆∆G and ∆T m are common metrics by molecular biologists, enzymologists, and protein engineers. ∆∆G evaluates the changes in free energy between a protein and a mutated variant. While ∆T m evaluates the change in thermal tolerance between two protein variants. The two are related through the Van 't Hoff equation (Wright et al., 2017) and it is common to obtain ∆∆G from T m measurements (e,g, Chen et al., 2013; Capriotti et al., 2005; Rodrigues et al., 2018; Pires et al., 2014a; Parthiban et al., 2006) .



Figure 1: Overheat unfolds and deactivates proteins (Paci & Karplus, 2000).

availability

//github.com/VITA-Group/HotProtein.

