ALPHAFOLD DISTILLATION FOR IMPROVED INVERSE PROTEIN FOLDING Anonymous

Abstract

Inverse protein folding, i.e., designing sequences that fold into a given threedimensional structure, is one of the fundamental design challenges in bioengineering and drug discovery. Traditionally, inverse folding mainly involves learning from sequences that have an experimentally resolved structure. However, the known structures cover only a tiny space of the protein sequences, imposing limitations on the model learning. Recently proposed forward folding models, e.g., AlphaFold, offer unprecedented opportunity for accurate estimation of the structure given a protein sequence. Naturally, incorporating a forward folding model as a component of an inverse folding approach offers the potential of significantly improving the inverse folding, as the folding model can provide a feedback on any generated sequence in the form of the predicted protein structure or a structural confidence metric. However, at present, these forward folding models are still prohibitively slow to be a part of the model optimization loop during training. In this work, we propose to perform knowledge distillation on the folding model's confidence metrics, e.g., pTM or pLDDT scores, to obtain a smaller, faster and end-to-end differentiable distilled model, which then can be included as part of the structure consistency regularized inverse folding model training. Moreover, our regularization technique is general enough and can be applied in other design tasks, e.g., sequence-based protein infilling. Extensive experiments show a clear benefit of our method over the non-regularized baselines. E.g., in inverse folding design problems we observe up to 3% improvement in sequence recovery and up to 45% improvement in protein diversity, while still preserving structural consistency of the generated sequences.

1. INTRODUCTION

To date, 8 out of 10 top selling drugs are engineered proteins (Arnum, 2022) . For functional protein design, it is often a pre-requisite that the designed protein folds into a specific three-dimensional structure.The fundamental task of designing novel amino acid sequences that will fold into the given 3D protein structure is named inverse protein folding. Inverse protein folding is therefore a central challenge in bio-engineering and drug discovery. Computationally, inverse protein folding can be formulated as exploring the protein sequence landscape for a given protein backbone to find a combination of amino acids that supports a property (e.g. structural consistency). This task, computational protein design, has been traditionally handled by learning to optimize amino acid sequences against a physics-based scoring function (Kuhlman et al., 2003) . In recent years, deep generative models have been proposed to solve this task, which consist of learning a mapping from protein structure to sequences (Jing et al., 2020; Cao et al., 2021; Wu et al., 2021; Karimi et al., 2020; Hsu et al., 2022; Fu & Sun, 2022) . These approaches frequently use high amino acid recovery with respect to the ground truth sequence (corresponding to the input structure) as one success criterion. Other success criteria are high TM score (reflecting structural consistency) and low perplexity (measuring likelihood to the training/natural sequence distribution). However, such criteria solely ignore the practical purpose of inverse protein folding, i.e., to design novel and diverse sequences that fold to the desired structure and thus exhibit novel functions. In parallel to machine learning advances in inverse folding, notable progresses have been made recently in protein representation learning (Rives et al., 2021; Zhang et al., 2022) , protein structure 1

