ROTAMER DENSITY ESTIMATOR IS AN UNSUPER-VISED LEARNER OF THE EFFECT OF MUTATIONS ON PROTEIN-PROTEIN INTERACTION

Abstract

Protein-protein interactions are crucial to many biological processes, and predicting the effect of amino acid mutations on binding is important for protein engineering. While data-driven approaches using deep learning have shown promise, the scarcity of annotated experimental data remains a major challenge. In this work, we propose a new approach that predicts mutational effects on binding using the change in conformational flexibility of the protein-protein interface. Our approach, named Rotamer Density Estimator (RDE), employs a flow-based generative model to estimate the probability distribution of protein side-chain conformations and uses entropy to measure flexibility. RDE is trained solely on protein structures and does not require the supervision of experimental values of changes in binding affinities. Furthermore, the unsupervised representations extracted by RDE can be used for downstream neural network predictions with even greater accuracy. Our method outperforms empirical energy functions and other machine learning-based approaches.

1. INTRODUCTION

Proteins rarely act alone and usually interact with other proteins to perform a diverse range of biological functions (Alberts & Miake-Lye, 1992; Kastritis & Bonvin, 2013) . For example, antibodies, a type of immune system protein, recognize and bind to proteins on pathogens' surfaces, eliciting immune responses by interacting with the receptor protein of immune cells (Lu et al., 2018) . Given the importance of protein-protein interactions in many biological processes, developing methods to modulate these interactions is critical. A common strategy to modulate protein-protein interactions is to mutate amino acids on the interface: some mutations enhance the strength of binding, while others weaken or even disrupt the interaction (Gram et al., 1992; Barderas et al., 2008) . Biologists may choose to increase or decrease binding strength depending on their specific goals. For example, enhancing the effect of a neutralizing antibody against a virus usually requires increasing the binding strength between the antibody and the viral protein. However, the combinatorial space of amino acid mutations is large, so it is not always feasible or affordable to conduct wet-lab assays to test all viable mutations. Therefore, computational approaches are needed to guide the identification of desirable mutations by predicting their mutational effects on binding strength, typically measured by the change in binding free energy (∆∆G). Traditional computational approaches are mainly based on biophysics and statistics (Schymkowitz et al., 2005; Park et al., 2016; Alford et al., 2017) . Although these methods have dominated the field for years, they have several limitations. Biophysics-based methods face a trade-off between efficiency and accuracy since they rely on sampling from energy functions. Statistical methods are more efficient, but their capacity is limited by the descriptors considered in the model. Furthermore, both biophysics and statistics-based methods heavily rely on human knowledge, preventing it to

