LIGHTWEIGHT EQUIVARIANT GRAPH REPRESENTA-TION LEARNING FOR PROTEIN ENGINEERING Anonymous

Abstract

This work tackles the issue of directed evolution in computational protein design that makes accurate predictions of the function of a protein mutant. We design a lightweight zero-shot graph neural network model for multi-task protein representation learning from its 3D structure. Rather than reconstructing and optimizing the protein structure, the trained model recovers the amino acid types and key properties of the central residues from a given noisy three-dimensional local environment. On the prediction of higher-order mutations where multiple amino acid sites of the protein are mutated simultaneously, the proposed strategy achieves remarkably higher performance by 20% improvement at the cost of requiring less than 1% of computational resources that are required by popular transformer-based state-of-the-art deep learning models for protein design.

1. INTRODUCTION

Mutation is a biological process where the amino acid (AA) type of one or multiple sites of a specific protein is changed. While the wild-type proteins' functions do not always meet the demand of bio-engineering, it is vital to manually optimize the functionality, namely fitness, with favorable mutations so that they are applicable in designing antibodies (Wu et al., 2019; Pinheiro et al., 2021; Shan et al., 2022) or enzymes (Sato & Ishida, 2019; Wittmann et al., 2021) . A protein usually constitutes hundreds to thousands of AAs, where each residue belongs to one of twenty AA types. To optimize a protein's functional fitness, a greedy search is usually conducted in the local sequence, where AA sites are mutated to proper AA types to render a protein mutant with the highest gain-of-function (Rocklin et al., 2017) . Such a process is called directed evolution Arnold (1998). To obtain a mutant with great fitness, multiple AA sites (∼5-10) of the protein need to be mutated, namely deep mutations (see Figure 1 ). It, however, requires enormous experimental costs, as the total number of potential combinations of mutations for deep mutants is astronomical. Since it is impossible to conduct systematic experimental tests on all possible deep mutations, in silico examination of protein variants' fitness becomes highly desirable. A handful of deep learning methods have been developed to accelerate the discovery of advantageous mutants. For instance, Lu et al. ( 2022) applied 3DCNN to identify a new polymerase with advantageous single-site mutation and enhanced the speed of degrading PET, i.e., a type of solid waste, by 7-8 times at 50 • C. Luo et al. (2021) proposed ECNET that predicts functional fitness for protein engineering with evolutionary context. The model guides the engineering of TEM-1 β-lactamase and identifies variants with improved ampicillin resistance. Thean et al. (2022) enhanced SVD with deep learning to predict nuclease variants' activities in multi-site-saturated mutagenesis libraries from and identified Cas9 nuclease variants that possess higher editing activity of derived base editors in human cells. Due to the scarcity of labeled protein data, researchers often pre-train an encoder for unsupervised learning with protein sequences or structures, and use the learned protein representations to train specific tasks, such as de novo protein design (Hsu et al., 2022) , mutation effect prediction (Ingraham et al., 2019; Jing et al., 2020; Meier et al., 2021; Notin et al., 2022) , and higher-level structure prediction (Elnaggar et al., 2021) . In the context of fitness prediction of mutation effect, existing methods usually transform the problem to mini-de novo design, which infers a specific AA type from its microenvironment, or analogously its neighboring AA types. Current state-of-the-art sequencebased protein learning methods rely heavily on multiple sequence alignment (MSA; Riesselman et al. (2018); Frazer et al. (2021); Rao et al. (2021) ) and protein language models (Elnaggar et al., 

