ROTAMER DENSITY ESTIMATOR IS AN UNSUPER-VISED LEARNER OF THE EFFECT OF MUTATIONS ON PROTEIN-PROTEIN INTERACTION

Abstract

Protein-protein interactions are crucial to many biological processes, and predicting the effect of amino acid mutations on binding is important for protein engineering. While data-driven approaches using deep learning have shown promise, the scarcity of annotated experimental data remains a major challenge. In this work, we propose a new approach that predicts mutational effects on binding using the change in conformational flexibility of the protein-protein interface. Our approach, named Rotamer Density Estimator (RDE), employs a flow-based generative model to estimate the probability distribution of protein side-chain conformations and uses entropy to measure flexibility. RDE is trained solely on protein structures and does not require the supervision of experimental values of changes in binding affinities. Furthermore, the unsupervised representations extracted by RDE can be used for downstream neural network predictions with even greater accuracy. Our method outperforms empirical energy functions and other machine learning-based approaches.

1. INTRODUCTION

Proteins rarely act alone and usually interact with other proteins to perform a diverse range of biological functions (Alberts & Miake-Lye, 1992; Kastritis & Bonvin, 2013) . For example, antibodies, a type of immune system protein, recognize and bind to proteins on pathogens' surfaces, eliciting immune responses by interacting with the receptor protein of immune cells (Lu et al., 2018) . Given the importance of protein-protein interactions in many biological processes, developing methods to modulate these interactions is critical. A common strategy to modulate protein-protein interactions is to mutate amino acids on the interface: some mutations enhance the strength of binding, while others weaken or even disrupt the interaction (Gram et al., 1992; Barderas et al., 2008) . Biologists may choose to increase or decrease binding strength depending on their specific goals. For example, enhancing the effect of a neutralizing antibody against a virus usually requires increasing the binding strength between the antibody and the viral protein. However, the combinatorial space of amino acid mutations is large, so it is not always feasible or affordable to conduct wet-lab assays to test all viable mutations. Therefore, computational approaches are needed to guide the identification of desirable mutations by predicting their mutational effects on binding strength, typically measured by the change in binding free energy (∆∆G). Traditional computational approaches are mainly based on biophysics and statistics (Schymkowitz et al., 2005; Park et al., 2016; Alford et al., 2017) . Although these methods have dominated the field for years, they have several limitations. Biophysics-based methods face a trade-off between efficiency and accuracy since they rely on sampling from energy functions. Statistical methods are more efficient, but their capacity is limited by the descriptors considered in the model. Furthermore, both biophysics and statistics-based methods heavily rely on human knowledge, preventing it to benefit from the growing availability of protein structures. As a result, predicting the effects of mutations on protein-protein binding remains an open problem. Recently, deep learning has shown significant promise in modeling proteins, making data-driven approaches more attractive than ever (Rives et al., 2019; Jumper et al., 2021) . However, developing deep learning-based models to predict mutational effects on protein-protein binding is challenging due to the scarcity of experimental data. Only a few thousand protein mutations, annotated with changes in binding affinity, are publicly available (Geng et al., 2019b) , making supervised learning challenging due to the potential for overfitting with insufficient training data. Another difficulty is the absence of the structure of mutated protein-protein complexes. Mutating amino acids on a protein complex leads to changes mainly in sidechain conformations (Najmanovich et al., 2000; Gaudreault et al., 2012) , which contribute to the change in binding free energy. However, the exact conformational changes upon mutation are unknown. In this work, we draw inspiration from the thermodynamic principle that protein-protein binding usually leads to entropy loss on the binding interface, which can be used to determine binding affinity (Brady & Sharp, 1997; Cole & Warwicker, 2002; Kastritis & Bonvin, 2013) . When two proteins bind, the residues located at the interface tend to become less flexible (i.e. having lower entropy) due to the physical and geometric constraints imposed by the binding partner (Figure 1 ). A higher amount of entropy loss corresponds to a stronger binding affinity. Therefore, by comparing the entropy losses of wild-type and mutated protein complexes, we can estimate the effect of mutations on binding affinity. Please refer to Section B in the appendix for a detailed discussion. Based on this principle, we introduce a novel approach to predict the impact of amino acid mutations on proteinprotein interaction. The core of our method is Rotamer Density Estimator (RDE), a conditional generative model that estimates the density of amino acid sidechain conformations (rotamers). We use the entropy of the estimated density as a metric of conformational flexibility. By subtracting the entropy of the separated proteins from the entropy of the complex, we obtain an estimation of binding affinity. Finally, we can assess the effect of mutations by comparing the estimated binding affinities of wildtype and mutant protein complexes. In addition to directly comparing entropy, we also employ neural networks to predict ∆∆G from the representations learned by RDE. Our method is an attempt to address the aforementioned challenges. Rotamer Density Estimator is trained solely on protein structures, not requiring other labels, making it an unsupervised learner of the mutation effect on protein-protein interaction. This feature mitigates the challenge posed by the scarcity of annotated mutation data. Moreover, our method does not require the mutated protein structure as input. Instead, it treats mutated structures as latent variables, which are approximated by RDE. Our method outperforms both empirical energy functions and machine learning models for predicting ∆∆G. Additionally, as a generative model for rotamers, RDE accurately predicts sidechain conformations.

2.1. MUTATIONAL EFFECT PREDICTION FOR PROTEIN-PROTEIN INTERACTION

Traditional approaches to predicting the effect of mutation on protein binding can be roughly divided into two classes: biophysical and statistical methods. Biophysical methods utilize energy functions to model inter-atomic interactions. These methods sample conformations of the mutated protein complex and predict changes in binding free energy (Schymkowitz et al., 2005; Park et al., 2016; Alford et al., 2017; Steinbrecher et al., 2017) . Statistical methods rely on feature engineering, which uses descriptors summarizing geometric, physical, evolutionary, and motif properties of proteins to predict mutational effects (Geng et al., 2019a; Zhang et al., 2020) . Traditional methods face the



Figure 1: The conformational flexibility of the interface generally decreases upon binding.

