DIFFUSION PROBABILISTIC MODELING OF PROTEIN BACKBONES IN 3D FOR THE MOTIF-SCAFFOLDING PROBLEM

Abstract

Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the largecompute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.

1. INTRODUCTION

A central task in protein design is creation of a stable scaffold to support a target motif. Here, motifs are structural protein fragments imparting biological function while scaffolds stabilize the motif's structure. Vaccines and enzymes have already been designed by solving certain instances of this motif-scaffolding problem (Procko et al., 2014; Correia et al., 2014; Jiang et al., 2008; Siegel et al., 2010) . However, successful solutions to this problem in the past have necessitated substantial expert involvement and laborious trial and error. Machine learning (ML) offers the hope to automate, and better direct this search. But existing ML approaches face one of two major roadblocks. First, many methods do not build scaffolds longer than about 20 residues. For many motif sizes of interest, the resulting proteins would be smaller than the shortest commonly-studied simple protein folds (35-40 residues) (Gelman & Gruebele, 2014). Second, while other methods may generate longer scaffolds using stochastic search techniques, they require hours of computation to generate a single plausible scaffold (Wang et al., 2022; Anishchenko et al., 2021; Tischer et al., 2020) . Moreover, when a plausible scaffold is found, it remains to be experimentally validated. Therefore, it is desirable to return not just a single scaffold but rather a set of scaffolds exhibiting diverse sequences and structural variation to increase the likelihood of success in practice. In the present work, we demonstrate the promise of a particular generative modeling approach within ML for efficiently returning a diverse set of motif-supporting scaffolds. Generative models have been shown to capture a distribution over diverse protein structures (Lin et al., 2021) . But it is not clear how to handle conditioning (on the motif) using these approaches. Diffusion probabilistic models (DPMs) offer a potential alternative; not only do they provide a more straightforward path to handling conditioning, but they have also enjoyed success generating small-molecules in 3D (Hoogeboom et al., 2022) . Extending DPMs to protein structures, though, is non-trivial; since proteins are larger than small molecules, modeling proteins requires handling the sequential ordering of residues and long-range interactions. Finally, while existing models often generate distance matrices (Anand & Huang, 2018; Lin et al., 2021) , we instead focus on generating a full set of 3D coordinates, which should improve designability in practice. Our resulting model, ProtDiff, is similar to concurrent work on E(3)-equivariant diffusion models for molecules (Hoogeboom et al., 2022) , but with modifications specific to protein structure. Moreover, we develop a novel motif-scaffolding procedure based on Sequential Monte Carlo, SMCDiff, that repurposes an unconditionally trained DPM for conditional sampling. We prove that if a DPM matches the data distribution, SMCDiff is guaranteed to provide exact conditional samples in a large-compute limit; this property contrasts with previous methods (Song et al., 2021; Zhou et al., 2021) , which we show introduce non-trivial approximation error that impedes performance. Our final motif-scaffolding generative framework, then, has two steps (Fig. 1 ): first we train ProtDiff to learn a distribution over protein backbones, and then we use SMCDiff with ProtDiff to inpaint arbitrary motifs. Ours is the first machine-learning method to construct scaffolds longer than 20 residues around motifs -we build up to 80 residues scaffolds on a test case. Beyond our progress on the motif-scaffolding problem, we provide the following technical contributions: (1) we introduce a protein-backbone generative model in 3D -with the ability to generate backbone samples that structurally agree with AlphaFold2 predictions, and (2) we develop a novel conditional sampling algorithm for inpainting.



Figure1: Overview of the conditional generative modeling approach to the motif-scaffolding problem. We train our new protein backbone diffusion model, ProtDiff, to generate realistic protein backbone structures. Next, we run SMCDiff, our conditional sampling algorithm, with ProtDiff to generate scaffolds (colored in red) conditioned on the motif (colored in blue). For self-consistency evaluation, we use a pretrained fixed-backbone sequence-design model (ProteinMPNN (Dauparas et al., 2022)) to generate the scaffold sequence from a sampled backbone. We then input the sequence to a structure prediction model, in our case AlphaFold2 (AF2)(Jumper et al., 2021), to generate the full protein structure from the generated sequence. We compare the backbone of the predicted structure with the original backbone structure using TM-score (Xu & Zhang, 2010) and root-meansquare-distance (RMSD) for the motif.

