LIGHTWEIGHT EQUIVARIANT GRAPH REPRESENTA-TION LEARNING FOR PROTEIN ENGINEERING Anonymous

Abstract

This work tackles the issue of directed evolution in computational protein design that makes accurate predictions of the function of a protein mutant. We design a lightweight zero-shot graph neural network model for multi-task protein representation learning from its 3D structure. Rather than reconstructing and optimizing the protein structure, the trained model recovers the amino acid types and key properties of the central residues from a given noisy three-dimensional local environment. On the prediction of higher-order mutations where multiple amino acid sites of the protein are mutated simultaneously, the proposed strategy achieves remarkably higher performance by 20% improvement at the cost of requiring less than 1% of computational resources that are required by popular transformer-based state-of-the-art deep learning models for protein design.

1. INTRODUCTION

Mutation is a biological process where the amino acid (AA) type of one or multiple sites of a specific protein is changed. While the wild-type proteins' functions do not always meet the demand of bio-engineering, it is vital to manually optimize the functionality, namely fitness, with favorable mutations so that they are applicable in designing antibodies (Wu et al., 2019; Pinheiro et al., 2021; Shan et al., 2022) or enzymes (Sato & Ishida, 2019; Wittmann et al., 2021) . A protein usually constitutes hundreds to thousands of AAs, where each residue belongs to one of twenty AA types. To optimize a protein's functional fitness, a greedy search is usually conducted in the local sequence, where AA sites are mutated to proper AA types to render a protein mutant with the highest gain-of-function (Rocklin et al., 2017) . Such a process is called directed evolution Arnold (1998). To obtain a mutant with great fitness, multiple AA sites (∼5-10) of the protein need to be mutated, namely deep mutations (see Figure 1 ). It, however, requires enormous experimental costs, as the total number of potential combinations of mutations for deep mutants is astronomical. Since it is impossible to conduct systematic experimental tests on all possible deep mutations, in silico examination of protein variants' fitness becomes highly desirable. A handful of deep learning methods have been developed to accelerate the discovery of advantageous mutants. For instance, Lu et al. ( 2022) applied 3DCNN to identify a new polymerase with advantageous single-site mutation and enhanced the speed of degrading PET, i.e., a type of solid waste, by 7-8 times at 50 • C. Luo et al. (2021) proposed ECNET that predicts functional fitness for protein engineering with evolutionary context. The model guides the engineering of TEM-1 β-lactamase and identifies variants with improved ampicillin resistance. Thean et al. (2022) enhanced SVD with deep learning to predict nuclease variants' activities in multi-site-saturated mutagenesis libraries from and identified Cas9 nuclease variants that possess higher editing activity of derived base editors in human cells. Due to the scarcity of labeled protein data, researchers often pre-train an encoder for unsupervised learning with protein sequences or structures, and use the learned protein representations to train specific tasks, such as de novo protein design (Hsu et al., 2022) , mutation effect prediction (Ingraham et al., 2019; Jing et al., 2020; Meier et al., 2021; Notin et al., 2022) , and higher-level structure prediction (Elnaggar et al., 2021) . In the context of fitness prediction of mutation effect, existing methods usually transform the problem to mini-de novo design, which infers a specific AA type from its microenvironment, or analogously its neighboring AA types. Current state-of-the-art sequencebased protein learning methods rely heavily on multiple sequence alignment (MSA; Riesselman et al. ( 2018 2021; Rives et al., 2021; Nijkamp et al., 2022; Brandes et al., 2022) . While MSA helps capture important evolutionary properties of the protein family, it nevertheless multiplies the requirements of computing resources. The latter protein language models derived from natural language processing (NLP) encode sequence semantics and often need hundreds of GPU cards to train on hundreds of millions of proteins. Meanwhile, an autoregressive inference process is usually required along the entire protein sequence to score a mutation on a single site, which further slows down the inference speed (Sato & Ishida, 2019; Liu et al., 2022; Hsu et al., 2022; Notin et al., 2022) . More importantly, when predicting the fitness of the higher-order mutants, most of these models made a crude assumption that the mutations on different sites happen sequentially or individually, which is incorrect in most cases (Lehner, 2011; Breen et al., 2012) . The ignored epistatic effects between different sites are potentially a key factor hindering the acquisition of favorable high-order mutants in directed evolution (Sarkisyan et al., 2016; Rollins et al., 2019) . Mutation of AA sites also occurs in nature, where an AA site might be mutated to any of the other 19 AA types in a random manner. It is suggested by natural selection that only the mutants that exhibit the best fitness and fit the environment survive. As a protein's functionality is determined by its structure, we encode the folded protein by a protein graph with AAs being graph nodes to provide an elegant 3D spatial description of the protein. The first-level information, such as AA types, spatial coordinates of Cα, and C-N angles between neighboring AAs, are embedded in node features. Altering AA types of a protein in nature can then be viewed as adding corruptions to the node features of the protein graph, and denoising the graph makes a remedy to search for mutants with the best fitness. We model the protein mutation effect prediction as a denoising problem with equivariant graph neural networks (Satorras et al., 2021) . For a given protein, the recovered predictions can be leveraged to forecast the fitness of deep mutational effects and discover favorable mutants. Compared to existing state-of-the-art deep learning methods for mutation effect prediction, such as ESM-1V (Meier et al., 2021) and ESM-IF1 (Hsu et al., 2022) , the designed lightweight equivariant graph neural network (LGN) stands out in three perspectives. First, LGN improves generalization ability through the multi-task learning strategy and biological prior knowledge. The pre-trained model encodes the chemical and physical properties of a given AA's microenvironment with domain knowledge for practically meaningful representations. Secondly, LGN avoids the independent-mutation assumptions by generating the probabilities of all the amino acid residues at a time, which implements the joint distribution of all variations. In literature, the higher-order mutation effect is usually approached by summing up log-odd-ratio scores of the corresponding individual single-site mutants. The linear combination over separately assigned predictions is unsubstantiated, as the independent mutations neglect the epistatic effect. Thirdly, LGN is efficient in both the training and inference phases. The spatial graph inputs portray the topological properties of proteins, which circumvents data augmentation that is typically required by sequence or grid representations. Equivariant message passing, alternatively, provides a feature distillation unit with translation and rotation equivariance and encodes AAs' microenvironment defined by the protein graph's geometry.



Figure 1: Mutating on more sites frequently results in a higher score, i.e., a smaller rank value.

