ALPHAFOLD DISTILLATION FOR IMPROVED INVERSE PROTEIN FOLDING Anonymous

Abstract

Inverse protein folding, i.e., designing sequences that fold into a given threedimensional structure, is one of the fundamental design challenges in bioengineering and drug discovery. Traditionally, inverse folding mainly involves learning from sequences that have an experimentally resolved structure. However, the known structures cover only a tiny space of the protein sequences, imposing limitations on the model learning. Recently proposed forward folding models, e.g., AlphaFold, offer unprecedented opportunity for accurate estimation of the structure given a protein sequence. Naturally, incorporating a forward folding model as a component of an inverse folding approach offers the potential of significantly improving the inverse folding, as the folding model can provide a feedback on any generated sequence in the form of the predicted protein structure or a structural confidence metric. However, at present, these forward folding models are still prohibitively slow to be a part of the model optimization loop during training. In this work, we propose to perform knowledge distillation on the folding model's confidence metrics, e.g., pTM or pLDDT scores, to obtain a smaller, faster and end-to-end differentiable distilled model, which then can be included as part of the structure consistency regularized inverse folding model training. Moreover, our regularization technique is general enough and can be applied in other design tasks, e.g., sequence-based protein infilling. Extensive experiments show a clear benefit of our method over the non-regularized baselines. E.g., in inverse folding design problems we observe up to 3% improvement in sequence recovery and up to 45% improvement in protein diversity, while still preserving structural consistency of the generated sequences.

1. INTRODUCTION

To date, 8 out of 10 top selling drugs are engineered proteins (Arnum, 2022) . For functional protein design, it is often a pre-requisite that the designed protein folds into a specific three-dimensional structure.The fundamental task of designing novel amino acid sequences that will fold into the given 3D protein structure is named inverse protein folding. Inverse protein folding is therefore a central challenge in bio-engineering and drug discovery. Computationally, inverse protein folding can be formulated as exploring the protein sequence landscape for a given protein backbone to find a combination of amino acids that supports a property (e.g. structural consistency). This task, computational protein design, has been traditionally handled by learning to optimize amino acid sequences against a physics-based scoring function (Kuhlman et al., 2003) . In recent years, deep generative models have been proposed to solve this task, which consist of learning a mapping from protein structure to sequences (Jing et al., 2020; Cao et al., 2021; Wu et al., 2021; Karimi et al., 2020; Hsu et al., 2022; Fu & Sun, 2022) . These approaches frequently use high amino acid recovery with respect to the ground truth sequence (corresponding to the input structure) as one success criterion. Other success criteria are high TM score (reflecting structural consistency) and low perplexity (measuring likelihood to the training/natural sequence distribution). However, such criteria solely ignore the practical purpose of inverse protein folding, i.e., to design novel and diverse sequences that fold to the desired structure and thus exhibit novel functions. In parallel to machine learning advances in inverse folding, notable progresses have been made recently in protein representation learning (Rives et al., 2021; Zhang et al., 2022) 

Training Inference

< l a t e x i t s h a 1 _ b a s e 6 4 = " J E J T W f l G P O P g g j + S E 8 1 n M 7 T 8 One way of doing this (red line) would be to use forward protein folding models, e.g., AlphaFold, estimate structure from generated sequence, compare it with the ground truth to compute metrics such as TM or LDDT and, finally, regularize the original training loss (usually cross-entropy (CE)). However, inference through folding models is slow (see Fig. 2 ), making them impractical to be a part of the optimization loop. Alternatively, bypassing structure estimation, the folding model's internal confidence metrics, such as pTM or pLDDT can be used instead (blue line). This results in lower fidelity solutions that are still slow. Instead, in this work, we propose to distill the confidence metrics of AlphaFold into a smaller, faster, and differentiable model (referred as AFDistill). AFDistill is trained to maintain the comparable accuracy of the AlphaFold-estimated pTM/pLDDT, which now can be seamlessly used as part of the training loop (green line). The inference of the improved inverse folding model remains unmodified and is shown on the right side of the figure. 3 d c = " > A A A C T X i c b V D B T h s x E P W m F E J o I R R x 4 m I R I X G K d i u g H K P 2 w h E k A k j Z K P J 6 J 8 G K 7 V 3 Z s z S R t R / D t f 2 Y n v s h 3 F B V b 7 I H A o x k 6 e n N j N + 8 l + R S W A z D v 0 H j w 9 r H 9 Y 3 m Z m v r 0 + f t n f b u l x u b F Y Z D n 2 c y M 3 c J s y C F h j 4 K l H C X G 2 A q k X C b T H 9 U / d s H M F Z k + h r n O Q w V m 2 g x F p y h p 0 b t f R c v P n E G 0 p L G M 8 X M t G y N 2 p 2 w G y 6 K v g V R D T q k r s v R b k D j N O O F A o 1 c M m s H U Z j j 0 D G D g k s o W 3 F h I W d 8 y i Y w 8 F A z B X b o F t I l P f J M S s e Z 8 U 8 j X b A v N x x T 1 s 5 V 4 i c V w 3 v 7 u l e R 7 / U G B Y 7 P h 0 7 o v E D Q f C k 0 L i T F j F Z h 0 F Q Y 4 C j n H j B u h L + V 8 n t m G E c f 2 Y p K L q r T v A 8 N P 3 m m F N O p q 9 N y M c I M X Z w K P X G n p 2 W 5 4 t b N l i a r T K P X C b 4 F N 1 + 7 0 V n 3 5 O q k 0 / t e p 9 s k B + S Q H J O I f C M 9 c k E u S Z 9 w 4 s g j + U V + B 3 + C p + A 5 + L c c b Q T 1 z h 5 Z q c b G f 3 F p t T 0 = < / l a t e x i t > 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " J E J T W f l G P O P g g j + S E 8 1 n M 7 T 8 3 d c = " > A A A C T X i c b V D B T h s x E P W m F E J o I R R x 4 m I R I X G K d i u g H K P 2 w h E k A k j Z K P J 6 J 8 G K 7 V 3 Z s z S R t R / D t f 2 Y n v s h 3 F B V b 7 I H A o x k 6 e n N j N + 8 l + R S W A z D v 0 H j w 9 r H 9 Y 3 m Z m v r 0 + f t n f b u l x u b F Y Z D n 2 c y M 3 c J s y C F h j 4 K l H C X G 2 A q k X C b T H 9 U / d s H M F Z k + h r n O Q w V m 2 g x F p y h p 0 b t f R c v P n E G 0 p L G M 8 X M t G y N 2 p 2 w G y 6 K v g V R D T q k r s v R b k D j N O O F A o 1 c M m s H U Z j j 0 D G D g k s o W 3 F h I W d 8 y i Y w 8 F A z B X b o F t I l P f J M S s e Z 8 U 8 j X b A v N x x T 1 s 5 V 4 i c V w 3 v 7 u l e R 7 / U G B Y 7 P h 0 7 o v E D Q f C k 0 L i T F j F Z h 0 F Q Y 4 C j n H j B u h L + V 8 n t m G E c f 2 Y p K L q r T v A 8 N P 3 m m F N O p q 9 N y M c I M X Z w K P X G n p 2 W 5 4 t b N l i a r T K P X C b 4 F N 1 + 7 0 V n 3 5 O q k 0 / t e p 9 s k B + S Q H J O I f C M 9 c k E u S Z 9 w 4 s g j + U V + B 3 + C p + A 5 + L c c b Q T 1 z h 5 Z q c b G f 3 F p t T 0 = < / l a t e x i t > 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " d 2 e G 6 f n T F f c E u Y 5 l Q o 7 9 O n 2 N l s o = " > A A A C d H i c b V H L a h s x F J W n j 6 T u y 2 m h m 3 Q h a g J d m Z m S R 5 e h h Z B l C n E S 8 B h z R 3 P t C E u a Q b q T 2 q j z M 9 2 2 P 5 Q f 6 T q a s R d 1 k g u C w 7 n 3 3 M d R V i r p K I 5 v O 9 G T p 8 + e b 2 2 / 6 L 5 8 9 f r N 2 9 7 O u w t X V F b g U B S q s F c Z O F T S 4 J A k K b w q L Y L O F F 5 m 8 + 9 N / v I G r Z O F O a d l i W M N M y O n U g A F a t L 7 4 N O 2 i S c E V f N U a L D z u j v p 9 e N B 3 A Z / C J I 1 6 L N 1 n E 1 2 O i d p X o h K o y G h w L l R E p c 0 9 m B J C o V 1 N 6 0 c l i D m M M N R g A Y 0 u r F v Z 9 d 8 L z A 5 n x Y 2 P E O 8 Z f 9 X e N D O L X U W K j X Q t b u f a 8 j H c q O K p l / H X p q y I j R i N W h a K U 4 F b 9 z g u b Q o S C 0 D A G F l 2 J W L a 7 A g K H i 2 M a W U z W r h D o M / R a E 1 m N y n i 9 Y t n x I u y K e 5 N D N / c F D X G 9 f 6 x e r I T a V 4 R J k 0 y u B 8 c t / n h + D i y y A 5 H O z / 2 O 8 f f 1 v / w T b b Z Z / Y Z 5 a w I 3 b M T t k Z G z L B f r H f 7 A / 7 2 / k X f Y z 6 0 d 6 q N O q s N e / Z R prediction from sequences (Jumper et al., 2021; Baek et al., 2021b) , as well as in conditional protein sequence generation (Das et al., 2021; Anishchenko et al., 2021) . These lines of works have largely benefited by learning from millions of available protein sequences (that may or may not have a resolved structure) in a self/un-supervised pre-training paradigm. Such large-scale pre-training has immensely improved the information content and task performance of the learned model. For example, it has been observed that structural and functional aspects emerge from a representation learned on broad protein sequence data (Rives et al., 2021) . In contrast, inverse protein folding has mainly focused on learning from sequences that do have an experimentally resolved structure. Those reported structures cover only less than 0.1% of the known space of protein sequences, limiting the learning of the inverse folding model. In this direction, a recent work has trained an inverse folding model from scratch on millions of AlphaFold-predicted protein structures (in addition to tens of thousands of experimentally resolved structures) and shown performance improvement in terms of amino acid recovery (Hsu et al., 2022) . However, such large-scale training from scratch is computationally expensive. A more efficient alternative would be to use the guidance of an already available forward folding model pre-trained on large-scale data in training the inverse folding model. In this work we construct a framework where the inverse folding model is trained using a loss objective that consists of regular sequence reconstruction loss, augmented with an additional structure consistency loss (SC) (see Fig. 1 for the system overview). The straightforward way of implementing this would be to use forward protein folding models, e.g., AlphaFold, to estimate the structure from generated sequence, compare it with ground truth and compute TM score to regularize the training. However, a challenge in using Alphafold (or similar) directly is the computational cost associated with its inference(see Fig. 2 ), as well as the need of ground truth reference structure. Internal confidence structure metrics from the forward folding model can be used instead. However, that approach is still slow for the in-the-loop inverse folding model optimization. To address this, in our work we: (i) Perform knowledge distillation on AlphaFold and include the resulting model, AFDistill (frozen), as part of the regularized training of the inverse folding model (we term this loss structure consistency (SC) loss). The main properties of AFDistill model are that it is fast, accurate and end-to-end differentiable. (ii) Perform extensive evaluations, where the results on standard structure-guided sequence design benchmarks show that our proposed system outperforms existing baselines in terms of lower perplexity and higher amino acid recovery, while maintaining closeness to original protein



Figure1: Overview of the proposed system. The traditional inverse protein folding (designing sequences that fold into a given 3D structures) is augmented by our proposed AFDistill model to increase the diversity of generated sequences while maintaining consistency with the given structure. One way of doing this (red line) would be to use forward protein folding models, e.g., AlphaFold, estimate structure from generated sequence, compare it with the ground truth to compute metrics such as TM or LDDT and, finally, regularize the original training loss (usually cross-entropy (CE)). However, inference through folding models is slow (see Fig.2), making them impractical to be a part of the optimization loop. Alternatively, bypassing structure estimation, the folding model's internal confidence metrics, such as pTM or pLDDT can be used instead (blue line). This results in lower fidelity solutions that are still slow. Instead, in this work, we propose to distill the confidence metrics of AlphaFold into a smaller, faster, and differentiable model (referred as AFDistill). AFDistill is trained to maintain the comparable accuracy of the AlphaFold-estimated pTM/pLDDT, which now can be seamlessly used as part of the training loop (green line). The inference of the improved inverse folding model remains unmodified and is shown on the right side of the figure.

, protein structure

