PREDICTION OF ENZYME SPECIFICITY USING PROTEIN GRAPH CONVOLUTIONAL NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Specific molecular recognition by proteins, for example, protease enzymes, is critical for maintaining the robustness of key life processes. The substrate specificity landscape of a protease enzyme comprises the set of all sequence motifs that are recognized/cut, or just as importantly, not recognized/cut by the enzyme. Current methods for predicting protease specificity landscapes rely on learning sequence patterns in experimentally derived data with a single enzyme, but are not robust to even small mutational changes. A comprehensive evaluation of specificity requires consideration of the three-dimensional structure and energetics of molecular interactions. In this work, we present a protein graph convolutional neural network (PGCN), which uses a physically intuitive, structure-based molecular interaction graph generated using the Rosetta energy function that describes the topology and energetic features, to determine substrate specificity. We use the PGCN to recapitulate and predict the specificity of the NS3/4 protease from the Hepatitic C virus. We compare our PGCN with previously used machine learning models and show that its performance in classification tasks is equivalent or better. Because PGCN is based on physical interactions, it is inherently more interpretable; determination of feature importance reveals key sub-graph patterns responsible for molecular recognition that are biochemically reasonable. The PGCN model also readily lends itself to the design of novel enzymes with tailored specificity against disease targets.

1. INTRODUCTION

Selective molecular recognition between biomolecules e.g. protein-protein, DNA-protein (Tainer & Cunningham, 1993) , and protein-small molecule interactions, is key for maintaining the fidelity of life processes. Multispecificity, i.e. the specific recognition and non-recogntion of multiple targets by biomolecules, is critical for many biological processes, for example the selective recognition and cleavage of host and viral target sites by viral protease enzymes is critical for the life-cycle of many RNA viruses including SARS- CoV-2 (Vizovišek et al., 2018) . Prospective prediction of the sequence motifs corresponding to protease enzyme target sites (substrates) is therefore an important goal with broad implications. Elucidating the target specificity of viral protease enzyme can be used for the design of inhibitor anti-viral drug candidates. The ability to accurately and efficiently model the landscape of protease specificity i.e. the set of all substrate sequence motifs that are recognized (and not recognized) by a given enzyme and its variants would also enable the design of proteases with tailores specificities to degrade chosen disease-related targets. Most current approaches for protease specificity prediction involve detecting and/or learning patterns in known substrate sequences using techniques ranging from logistic regression to deep learning. However, these black-box approaches do not provide any physical/chemical insight into the underlying basis for a particular specificity profile, nor are they robust to changes in the protease enzyme that often arise in the course of evolution. A comprehensive model of protease specificity requires the consideration of the three-dimensional structure of the enzyme and the energetics of interaction between enzyme and various substrates such that substrates that are productively recognized (i.e. cleaved) by the protease are lower in energy than those that are not. To encode the topology and energetic features, here we develop Protein Convolutional Neural Networks (PGCN). PGCN uses experimentally derived data and a physically-intuitive structure-based molecular interaction energy graph to solve the classification problem for substrate specificity. Protease and substrate residues are considered as nodes, and energies of interactions are obtained from a pairwise decomposition of the energy of the complex calculated using the Rosetta energy function. These energies are assigned as (multiple) node and edge features. We find that PGCN is as good as or better than other previously used machine learning models for protease specificity prediction. However, it is more interpretable and highlights critical sub-graph patterns responsible for observed specificity patterns. As it is based on physical interactions, the PGCN model is capable of both prospective prediction of specificity of chosen protease enzymes and generating novel designed enzymes with tailored specificity again chosen targets.

2. RELATED WORK

In this work, we develop a graph-based deep learning technique for protease enzyme specificity prediction. Here we provide a brief review of previously developed predictive methods for protease specificity landscape prediction and applications of graph-based convolutional neural networks on protein-related problems.

2.1. PREDICTION OF PROTEASE SPECIFICITY LANDSCAPE

Current methods to discriminate the specificity landscape of one or more types of protease enzymes, could be classified into two categories, machine learning approaches and scoring-matrix-based approaches (Li et al., 2019) . Methods use machine learning methods such as logistic regression, random forest, decision tree, support vector machine (SVM) to predict substrate specificity. The most popular tool is SVM among them, e.g. PROSPER (Song et al., 2012 ), iProt-Sub (Song et al., 2018) , CASVM (Wee et al., 2007) , Cascleave (Song et al., 2010) . Besides, NeuroPred (Southey et al., 2006) and PROSPERous (Song et al., 2017) applied logistic regression to predict specific neuropeptide specificity (Neuropred) and 90 different proteases (PROSPERous). Pripper (Piippo et al., 2010) provided three different classifiers based on SVM, decision tree and random forest. Procleave (Li et al., 2020b ) implemented a probabilistic model trained with both sequence and structure feature information. DeepCleave (Li et al., 2020a ) is a tool to predict substrate specificity by using convolutional neural network (CNN). None of methods mentioned above use energy-related features, however we note that some energy terms for the interface were considered in (Pethe et al., 2017) and (Pethe et al., 2018) .

2.2. GRAPH CONVOLUTIONAL NEURAL NETWORK ON PROTEIN-RELATED PROBLEMS

There are several works proposing or implementing a graph-based convolutional neural network model to solve various protein modeling-related problems. BIPSPI (Sanchez-Garcia et al., 2019) made use of both hand-crafted sequence and structure features to predict residue-residue contacts in proteins. Gligorijevic et al. (2020) proposed a novel model that generated node features by using LSTM to learn genetic information and edge adjacency matrix from contact maps to classify different protein functions. Graph convolutional neural networks are also applied to drug discovery, for either node classification or energy score prediction (Sun et al., 2019 ). Fout et al. (2017) proposed a graph-based model to encode a protein and a drug into each graph, which considered local neighborhood information from each node and learned multiple edge features for edges between neighbor residues and nodes. Zamora-Resendiz & Crivelli (2019) addressed their model to learn sequence-and structure-based information more efficiently than 2D/3D-CNN for protein structure classification. Moreover, Cao & Shen (2019) and Sanyal et al. (2020) aimed at improving the energy functions used for protein model evaluation by using molecular graphs. Unlike previous work, our approach uses per-residue and residue-residue pairwise energies as features for predicting molecular function.

