PREDICTION OF ENZYME SPECIFICITY USING PROTEIN GRAPH CONVOLUTIONAL NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Specific molecular recognition by proteins, for example, protease enzymes, is critical for maintaining the robustness of key life processes. The substrate specificity landscape of a protease enzyme comprises the set of all sequence motifs that are recognized/cut, or just as importantly, not recognized/cut by the enzyme. Current methods for predicting protease specificity landscapes rely on learning sequence patterns in experimentally derived data with a single enzyme, but are not robust to even small mutational changes. A comprehensive evaluation of specificity requires consideration of the three-dimensional structure and energetics of molecular interactions. In this work, we present a protein graph convolutional neural network (PGCN), which uses a physically intuitive, structure-based molecular interaction graph generated using the Rosetta energy function that describes the topology and energetic features, to determine substrate specificity. We use the PGCN to recapitulate and predict the specificity of the NS3/4 protease from the Hepatitic C virus. We compare our PGCN with previously used machine learning models and show that its performance in classification tasks is equivalent or better. Because PGCN is based on physical interactions, it is inherently more interpretable; determination of feature importance reveals key sub-graph patterns responsible for molecular recognition that are biochemically reasonable. The PGCN model also readily lends itself to the design of novel enzymes with tailored specificity against disease targets.

1. INTRODUCTION

Selective molecular recognition between biomolecules e.g. protein-protein, DNA-protein (Tainer & Cunningham, 1993) , and protein-small molecule interactions, is key for maintaining the fidelity of life processes. Multispecificity, i.e. the specific recognition and non-recogntion of multiple targets by biomolecules, is critical for many biological processes, for example the selective recognition and cleavage of host and viral target sites by viral protease enzymes is critical for the life-cycle of many RNA viruses including SARS-CoV-2 (Vizovišek et al., 2018) . Prospective prediction of the sequence motifs corresponding to protease enzyme target sites (substrates) is therefore an important goal with broad implications. Elucidating the target specificity of viral protease enzyme can be used for the design of inhibitor anti-viral drug candidates. The ability to accurately and efficiently model the landscape of protease specificity i.e. the set of all substrate sequence motifs that are recognized (and not recognized) by a given enzyme and its variants would also enable the design of proteases with tailores specificities to degrade chosen disease-related targets. Most current approaches for protease specificity prediction involve detecting and/or learning patterns in known substrate sequences using techniques ranging from logistic regression to deep learning. However, these black-box approaches do not provide any physical/chemical insight into the underlying basis for a particular specificity profile, nor are they robust to changes in the protease enzyme that often arise in the course of evolution. A comprehensive model of protease specificity requires the consideration of the three-dimensional structure of the enzyme and the energetics of interaction between enzyme and various substrates such that substrates that are productively recognized (i.e. cleaved) by the protease are lower in energy than those that are not.

