PIFOLD: TOWARD EFFECTIVE AND EFFICIENT PROTEIN INVERSE FOLDING

Abstract

How can we design protein sequences folding into the desired structures effectively and efficiently? AI methods for structure-based protein design have attracted increasing attention in recent years; however, few methods can simultaneously improve the accuracy and efficiency due to the lack of expressive features and autoregressive sequence decoder. To address these issues, we propose PiFold, which contains a novel residue featurizer and PiGNN layers to generate protein sequences in a one-shot way with improved recovery. Experiments show that Pi-Fold could achieve 51.66% recovery on CATH 4.2, while the inference speed is 70 times faster than the autoregressive competitors. In addition, PiFold achieves 58.72% and 60.42% recovery scores on TS50 and TS500, respectively. We conduct comprehensive ablation studies to reveal the role of different types of protein features and model designs, inspiring further simplification and improvement. The PyTorch code is available at GitHub.

1. INTRODUCTION

Proteins are linear chains of amino acids that fold into 3D structures to control cellular processes, such as transcription, translation, signaling, and cell cycle control. Creating novel proteins for human purposes could deepen our understanding of living systems and facilitate the fight against the disease. One of the crucial problems is to design protein sequences that fold to desired structures, namely structure-based protein design (Pabo, 1983) . Recently, many deep learning models have been proposed to solve this problem (Li et al., 2014; Wu et al., 2021; Pearce & Zhang, 2021; Ovchinnikov & Huang, 2021; Ding et al., 2022; Gao et al., 2020; 2022a; Dauparas et al., 2022; Ingraham et al., 2019; Jing et al., 2020; Tan et al., 2022c; Hsu et al., 2022; O'Connell et al., 2018; Wang et al., 2018; Qi & Zhang, 2020; Strokach et al., 2020; Chen et al., 2019; Zhang et al., 2020a; Huang et al., 2017; Anand et al., 2022; Strokach & Kim, 2022; Li & Koehl, 2014; Greener et al., 2018; Karimi et al., 2020; Anishchenko et al., 2021; Cao et al., 2021; Liu et al., 2022; McPartlon et al., 2022; Huang et al., 2022; Dumortier et al., 2022; Li et al., 2022a; Maguire et al., 2021; Li et al., 2022b) , among which graph-based models have made significant progress. However, there is still room to improve the model's accuracy and efficiency. For example, most graph models could not achieve 50+% sequence recovery in CATH dataset due to the lack of expressive residue representations. Moreover, they also suffer from the autoregressive generation, resulting in low inference speed. We aim to simultaneously improve the accuracy and efficiency with a simple model containing as few redundancies as possible. For Most graph models (Dauparas et al., 2022; Ingraham et al., 2019; Jing et al., 2020; Hsu et al., 2022; Tan et al., 2022c; Hsu et al., 2022) adopt the autoregressive decoding scheme to generate amino acids, dramatically slowing down the inference process. Interestingly, few studies have attempted to improve the model efficiency, perhaps because the efficiency gain requires sacrificing some accuracy (Bahdanau et al., 2015; Vaswani et al., 2017; Ghazvininejad et al., 2019; Geng et al., 2021; Xie et al., 2020; Wang et al., 2019; Gu et al., 2018) , while the latter is more important than efficiency in protein design. To address this dilemma, AlphaDesign (Gao et al., 2022a) proposes a parallel self-correcting module to speed up inference while almost maintaining the recovery. Nevertheless, it still causes some performance degradation and requires two iterations for prediction. Can we generate protein sequences in a one-shot way without loss of accuracy? We propose PiFold (protein inverse folding) to address the problems above, containing a novel residue featurizer and stacked PiGNNs. As to the featurizer, for each residue, we construct more comprehensive features and introduce learnable virtual atoms to capture information overlooked by real atoms. The PiGNN considers feature dependencies at the node, edge, and global levels to learn from multi-scale residue interactions. In addition, we could completely remove the autoregressive decoder by stacking more PiGNN layers without sacrificing accuracy. Experiments show that PiFold can achieve state-of-the-art recoveries on several real-world datasets, i.e., 51.66% on CATH 4.2, 58.72 on TS50, and 60.42% on TS500. PiFold is the first graph model that could exceed 55% recovery on TS50 and 60% recovery on TS500. In addition, PiFold's inference efficiency in designing long proteins is improved by 70+ times compared to autoregressive competitors. More importantly, we conduct extensive ablation studies to reveal the role of each module, which helps deepen the reader's understanding of PiFold and may provide further inspiration for subsequent research. In summary, our contributions include:



† Equal Contribution, * Corresponding Author.



Figure1: Performance comparison with other graph-based protein design methods. The recovery scores, the inference time costs, and the perplexities are shown in the Y-axis direction, the X-axis direction, and the circle size, respectively. Note that the recovery and perplexity results in the CATH dataset are reported without using any other training data. The inference time is averaged over 100 long protein sequences of average length 1632 on an NVIDIA V100.

years, graph-based models have struggled to learn expressive residue representations through better feature engineering, more elaborate models, and a larger training dataset. For example, Al-phaDesign(Gao et al., 2022a)  andProteinMPNN (Dauparas et al., 2022)  point out that additional angle and distance features could significantly improve the representation quality. GraphTrans (In-

acknowledgement

1. We propose a novel residue featurizer to construct comprehensive residue features and learn virtual atoms to capture complementary information with real atoms.2. We propose a PiGNN layer to learn representations from multi-scale residue interactions.3. We suggest removing the autoregressive decoder to speed up without sacrificing accuracy.4. We comprehensively compare advanced graph models in real-world datasets, e.g., CATH, TS50, and TS500, and demonstrate the potential of designing different protein chains.

