HOTPROTEIN: A NOVEL FRAMEWORK FOR PROTEIN THERMOSTABILITY PREDICTION AND EDITING

Abstract

The molecular basis of protein thermal stability is only partially understood and has major significance for drug and vaccine discovery. The lack of datasets and standardized benchmarks considerably limits learning-based discovery methods. We present HotProtein, a large-scale protein dataset with growth temperature annotations of thermostability, containing 182K amino acid sequences and 3K folded structures from 230 different species with a wide temperature range -20 • C ∼ 120 • C. Due to functional domain differences and data scarcity within each species, existing methods fail to generalize well on our dataset. We address this problem through a novel learning framework, consisting of (1) Protein structure-aware pre-training (SAP) which leverages 3D information to enhance sequence-based pre-training; (2) Factorized sparse tuning (FST) that utilizes low-rank and sparse priors as an implicit regularization, together with feature augmentations. Extensive empirical studies demonstrate that our framework improves thermostability prediction compared to other deep learning models. Finally, we introduce a novel editing algorithm to efficiently generate positive amino acid mutations that improve thermostability.

1. INTRODUCTION

Overheat unfolds and deactivates Proteins. Proteins are the bio-polymers responsible for executing most biological phenomena and, through evolution, have had their sequences optimized to carry out specific functions within specific cellular environments. A protein's stability is a multidimensional property that depends on a series of factors (Pucci et al., 2017; Cao et al., 2019) such as pH, salinity, and temperature (thermostability shown in Figure 1 ), making it hard to adapt a protein to function outside of its endogenous cellular environment. Protein engineering is the field where natural proteins are mutated to improve their stability in exogenous environments and their overall fitness for a particular function. In protein engineering, one of the initial goals for most engineering campaigns is to improve the thermal stability of protein (Haki & Rakshit, 2003; Bruins et al., 2001; Frokjaer & Otzen, 2005) . Thermally stabilized proteins are more robust and therefore enable downstream applications in the food (Kapoor et al., 2017 ), biofuel (Huang et al., 2020 ), detergent (Von der Osten et al., 1993 ), chemical (Cho et al., 2015) , and pharmaceutical industry (Amara, 2013), drug design (De Carvalho, 2011; Mora & Telford, 2010) , and bioremediation of environmental pollutants (Lu et al., 2022; Alcalde et al., 2006) . Thus, to accelerate the engineering of a target protein it is critical to understand and accurately predict thermal stability changes of mutations. There has been a substantial effort from the community to quantitatively understand and model protein thermostability (e.g., Pucci et al., 2017; Cao et al., 2019; Pucci & Rooman, 2014; Li et al., 2019; Pucci & Rooman, 2017; Pouyan et al., 2022) . However, the generalizability of them is still unsatisfactory, and laborious experimental methods such as directed evolution are often preferred.



Figure 1: Overheat unfolds and deactivates proteins (Paci & Karplus, 2000).

