DIFFUSION PROBABILISTIC MODELING OF PROTEIN BACKBONES IN 3D FOR THE MOTIF-SCAFFOLDING PROBLEM

Abstract

Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the largecompute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.

1. INTRODUCTION

A central task in protein design is creation of a stable scaffold to support a target motif. Here, motifs are structural protein fragments imparting biological function while scaffolds stabilize the motif's structure. Vaccines and enzymes have already been designed by solving certain instances of this motif-scaffolding problem (Procko et al., 2014; Correia et al., 2014; Jiang et al., 2008; Siegel et al., 2010) . However, successful solutions to this problem in the past have necessitated substantial expert involvement and laborious trial and error. Machine learning (ML) offers the hope to automate, and better direct this search. But existing ML approaches face one of two major roadblocks. First, many methods do not build scaffolds longer than about 20 residues. For many motif sizes of interest, the resulting proteins would be smaller than the shortest commonly-studied simple protein folds (35-40 residues) (Gelman & Gruebele, 2014) . Second, while other methods may generate longer scaffolds using stochastic search techniques, they require hours of computation to generate a single plausible scaffold (Wang et al., 2022; Anishchenko et al., 2021; Tischer et al., 2020) . Moreover, when a plausible scaffold is found, it remains to be experimentally validated. Therefore, it is desirable to return not just a single scaffold but rather a set of scaffolds exhibiting diverse sequences and structural variation to increase the likelihood of success in practice. In the present work, we demonstrate the promise of a particular generative modeling approach within ML for efficiently returning a diverse set of motif-supporting scaffolds. Generative models have been shown to capture a distribution over diverse protein structures (Lin et al., 2021) . But it is not clear how to handle conditioning (on the motif) using these approaches. Diffusion probabilistic models (DPMs) offer a potential alternative; not only do they provide a more straightforward path to handling conditioning, but they have also enjoyed success generating small-molecules in 3D (Hoogeboom Figure 1 : Overview of the conditional generative modeling approach to the motif-scaffolding problem. We train our new protein backbone diffusion model, ProtDiff, to generate realistic protein backbone structures. Next, we run SMCDiff, our conditional sampling algorithm, with ProtDiff to generate scaffolds (colored in red) conditioned on the motif (colored in blue). For self-consistency evaluation, we use a pretrained fixed-backbone sequence-design model (ProteinMPNN (Dauparas et al., 2022) ) to generate the scaffold sequence from a sampled backbone. We then input the sequence to a structure prediction model, in our case AlphaFold2 (AF2) (Jumper et al., 2021) , to generate the full protein structure from the generated sequence. We compare the backbone of the predicted structure with the original backbone structure using TM-score (Xu & Zhang, 2010) and root-meansquare-distance (RMSD) for the motif. et al., 2022) . Extending DPMs to protein structures, though, is non-trivial; since proteins are larger than small molecules, modeling proteins requires handling the sequential ordering of residues and long-range interactions. Finally, while existing models often generate distance matrices (Anand & Huang, 2018; Lin et al., 2021) , we instead focus on generating a full set of 3D coordinates, which should improve designability in practice. Our resulting model, ProtDiff, is similar to concurrent work on E(3)-equivariant diffusion models for molecules (Hoogeboom et al., 2022) , but with modifications specific to protein structure. Moreover, we develop a novel motif-scaffolding procedure based on Sequential Monte Carlo, SMCDiff, that repurposes an unconditionally trained DPM for conditional sampling. We prove that if a DPM matches the data distribution, SMCDiff is guaranteed to provide exact conditional samples in a large-compute limit; this property contrasts with previous methods (Song et al., 2021; Zhou et al., 2021) , which we show introduce non-trivial approximation error that impedes performance. Our final motif-scaffolding generative framework, then, has two steps (Fig. 1 ): first we train ProtDiff to learn a distribution over protein backbones, and then we use SMCDiff with ProtDiff to inpaint arbitrary motifs. Ours is the first machine-learning method to construct scaffolds longer than 20 residues around motifs -we build up to 80 residues scaffolds on a test case. Beyond our progress on the motif-scaffolding problem, we provide the following technical contributions: (1) we introduce a protein-backbone generative model in 3D -with the ability to generate backbone samples that structurally agree with AlphaFold2 predictions, and (2) we develop a novel conditional sampling algorithm for inpainting.

1.1. RELATED WORK

Motif-scaffolding. Past approaches have sought to scaffold a motif with native or prespecified protein fragments, but are limited to finding a suitable match in the Protein Data Bank (PDB) and cannot adapt the scaffold to compensate for slight structural mismatches (Cao et al., 2022; Silva et al., 2016; Yang et al., 2021; Sesterhenn et al., 2020; Linsky et al., 2020) . More recently Wang et al. (2022) used pre-trained protein structure prediction networks to recapitulate native scaffolds, but this method failed to generate scaffolds longer than 20 residues and can output only a single candidate scaffold rather than a diverse set. By contrast, our goal is to sample diverse, long scaffolds. Diffusion models for molecule generation. Several concurrent works have extended equivariant diffusion models to molecule generation. Anand & Achim (2022) extended diffusion models for generation of protein backbone frames and sequences conditioned on secondary structure adjacency matrices. Similarly, Luo et al. (2022) focused on CDR-loop generation using diffusion models conditioned on non-CDR regions of the antibody-antigen. Our method does not require conditioning and is applicable to general proteins. Lee & Kim (2022) approach the same problem as our work but build diffusion models over 2D distances matrices that requires post-processing to produce 3D structures through Rosetta minimization. We demonstrate capability of diffusion models to directly model 3D coordinates of proteins. Hoogeboom et al. (2022) developed an equivariant diffusion model (EDM) for generating small molecules in 3D. However, because EDM does not enforce a spatial ordering of the atoms that compose small molecules, as we describe in Section 5, it does not learn a coherent chain structure as needed in proteins. Inpainting and conditional sampling in diffusion models. Point-Voxel Diffusion (PVD) (Zhou et al., 2021) is a 3D diffusion model for generating shapes from the ShapeNet dataset. Though trained to generate shapes unconditionally, PVD completes (or inpaints) full shapes when a partial point cloud is fixed during inference. For general diffusion models, Song et al. (2021) proposed an alternative inpainting approach and remarked that this approach produces approximate conditional samples. However, these methods do not provide theoretical guarantees, and when we compare them to SMCDiff, we find that their approximation error impedes performance when applied to motifscaffolding. Saharia et al. (2021) developed an inpainting diffusion model by training a diffusion model to denoise randomly generated masked regions while unmasked regions were unperturbed. However, their approach requires a detailed data augmentation strategy that does not exist for proteins. We describe additional related work on protein generative models in Appendix A.

2.1. THE MOTIF-SCAFFOLDING PROBLEM

A protein can be represented by its amino acid sequence and backbone structure. Let A be the set of 20 genetically-encoded amino acids. We denote the sequence of an N -residue protein by s ∈ A N and its C-α backbone coordinates in 3D by x = [x 1 , . . . , x N ] T ∈ R N,3 . We describe a protein as having a fixed structure that is a function of its sequence, so we may write x(s). We divide the N residues into the functional motif M and the scaffold S, such that M ∪ S = {1, 2, . . . , N }. The goal is to identify, given the motif structure x M , sequences s whose structure recapitulates the motif to high precision x(s) M ≈ x M . Appendix B discusses several caveats of this simplified framing (e.g. our assumption of static structures).

2.2. DIFFUSION PROBABILISTIC MODELS

Our approach to the motif-scaffolding problem builds on denoising diffusion probabilistic models (DPMs) (Sohl-Dickstein et al., 2015) . We follow the conventions and notation set by Ho et al. (2020) , which we review here. DPMs are a class of generative models based on a reversible, discrete-time diffusion process. The forward process starts with a sample x (0) from an unknown data distribution q, with density denoted by q(x (0) ), and iteratively adds noise at each step t. By the last step, T , the distribution of x (T ) is indistinguishable from an isotropic Gaussian: x (T ) ∼ N (x (T ) ; 0, I). Specifically, we choose a variance schedule β (1) , β (2) , . . . , β (T ) , and define the transition distribution at step t as q(x (t) | x (t-1) ) = N (x (t) ; 1 -β (t) x (t-1) , β (t) I). DPMs approximate q with a second distribution p θ by learning the transition distribution of the reverse process at each t, p θ (x (t-1) | x (t) ). We follow the conventions set by Ho et al. (2020) in our parameterization and choice of objective. In particular, we take p θ (x t) , t) , α (t) := 1β (t) , and ᾱ(t) := t s=1 α (t) . We implement ϵ θ (x (t) , t) as a neural network. For training, we marginally sample x (t) ∼ q(x (t) | x (0) ) from the forward process as t) , t)∥ 2 by stochastic optimization (Ho et al., 2020, Algorithm 1) . To generate samples from p θ (x (0) ), we simulate the reverse process. That is, we sample noise for time T as x (T ) ∼ N (0, I), and then for each t = T -1, . . . , 0, we simulate progressively "de-noised" samples as x (t) ∼ p θ (x (t) | x (t+1) ). (t-1) | x (t) ) = N (x (t-1) ; µ θ (x (t) , t), β (t) I) with µ θ (x (t) , t) = 1 √ α (t) x (t) -β (t) √ 1-ᾱ(t) ϵ θ (x ( x (t) = √ ᾱ(t) x (0) + √ 1 -ᾱ(t) ϵ and minimize the objective T -1 T t=1 E q(x (0) ,x (t) ) ∥ϵ -ϵ θ (x (

3. PR O TDI F F : A DIFFUSION MODEL OF PROTEIN BACKBONES IN 3D

Implementation of diffusion probabilistic models requires choosing an architecture for the neural network ϵ θ (x (t) , t) introduced abstractly in Section 2.2. In this section we describe ProtDiff, which corresponds to the choice of ϵ θ (x (t) , t) as a translation and rotation equivariant graph neural network tailored to modeling protein backbones. We leave architectural and input encoding details to Appendix C. The challenge of modeling points in 3D. The properties and functions of proteins are dictated by the relative geometry of their residues, and are invariant to the coordinate system chosen to encode them. Recent work on neural network modeling of 3D data has found, both theoretically and empirically, that neural networks constrained to satisfy geometric invariances can provide inductive biases that improve generalization and training efficiency (Batzner et al., 2022) . Motivated by this observation, we parameterize ϵ θ by an equivariant graph neural network (EGNN) (Satorras et al., 2021) , which in 3D is equivariant to transformations in the Euclidean group. Xu et al. (2022) proved that if ϵ θ is equivariant to a group then p θ is invariant to the same group. Tailoring EGNN to protein backbones. We now describe our EGNN implementation, which we tailor to protein backbones and DPMs through the choice of edge and node features. To model every pairwise residue interaction, we represent backbones by a fully connected graph. Each node in the graph is indexed by n = 1, . . . , N, and corresponds to a residue. We associate each node with coordinates x n ∈ R 3 and D features h n ∈ R D . For each pair of nodes n, n ′ we define an edge and associate it with edge features. We construct our EGNN by stacking L equivariant graph convolutional layers (EGCL) . Each layer takes node coordinates and features as input, and outputs updated coordinates and features with the first layer taking initial values (x, h). We write the output of EGNN after L layers as x = EGNN[x, h]. In the context of diffusion models, we predict the noise at time t with the following parameterization: ϵ θ (x (t) , t) = x -x (t) , x = EGNN[x (t) , h(t)]. (1) We now describe our choice of node and edge features. Our choice is motivated by the linear chain structure of protein backbones; residues close in sequence are necessarily close in 3D space. To allow this chain constraint to be learned more easily, we fix an ordering of nodes in the graph to correspond to sequence order. We include as edge features positional offsets as done in Ingraham et al. (2019) , which we represent using sinusoidal positional encoding features (Vaswani et al., 2017) . For node features, we similarly use a sinusoidal encoding of sequence position as well as of the diffusion time step t following Kingma et al. (2021) . We additionally process the time encoding to be orthogonal to the positional encoding .

FILTERING

The second stage of our generative modeling approach to the motif-scaffolding problem is to sample scaffolds x (0) S from p θ (x (0) S | x M ). Section 4.1 discusses the intractability of sampling from p θ (x (0) S | x (0) M ) exactly and the limitations of a simple approximation introduced by (Song et al., 2021) . In Section 4.2, we then frame computation of p θ (x (0) S |x (0) M ) as a sequential Monte Carlo (SMC) problem (Doucet et al., 2001) and approximate it with a particle filtering algorithm (Algorithm 1).

METHOD

The conditional distributions of a DPM are defined implicitly through the steps of the reverse process. We may write the conditional density explicitly as p θ (x (0) S | x (0) M ) ∝ p θ (x (0) S , x (0) M ) = p θ (x (0) ) = p θ (x (T ) ) T -1 t=0 p θ (x (t) | x (t+1) )dx (1:T ) . However, the high-dimensional integral on the right-hand side above is intractable (both analytically and numerically) to compute. Algorithm 1 SMCDiff: Particle filtering for conditionally sampling from unconditional diffusion models 1: Input: ∀k, w ∀k, x x (0) M (motif), K (# particles) 2: // Forward diffuse motif 3: x(1:T ) M ∼ q(x (1:T ) M | x (0) M ) 4: 5: // Reverse diffuse particles 6: ∀k, x (T ) k i.i.d. ∼ p θ (x (T ) ) 7: for t = T, . . . , 1 do 8: // Replace motif 9: ∀k, x (t) k ← [x (t) M , x (t) k ← p θ (x (t-1) M | x (t) k ) 13: ∀k, w(t) k ← w (t) k / K k ′ =1 w (t) k ′ 14: x(t) 1:K ∼ Resample( w(t) 1:K , x (t-1) k indep. ∼ p θ (x (t-1) | x(t) k ) 18: end for 19: Return x (0) S,1:K To overcome this intractability, we build on the work of Song et al. (2021) , who introduced a practical algorithm that generates approximate conditional samples. This strategy is to (1) forward diffuse the conditioning variable to obtain x (1:T ) M ∼ q(x (1:T ) M | x (0) M ), and then (2) for each t, sample x (t) S ∼ p θ (x (t) S | x (t+1) M , x (t+1) S ). We call this approach the replacement method (following Ho et al. ( 2022)) and make it explicit in Appendix Algorithm 2. However, in Proposition D.1 we show that the replacement method introduces irreducible approximation error that cannot be eliminated by making p θ more expressive. Additionally, although this approximation error is not analytically tractable in general, we show in Corollary D.2 the dependence of this error on the covariance of x (0) M and x (0) S in the case that q(x (0) M , x (0) S ) is bivariate Gaussian. 4.2 CONDITIONAL SAMPLING IS A SEQUENTIAL MONTE CARLO PROBLEM We next frame approximation of q(x (0) S | x (0) M ) as a sequential Monte Carlo problem that we may solve by particle filtering. Intuitively, particle filtering addresses a limitation of the replacement method: the failure at each time t to look beyond the current step to the less-noised motif x (t-1) M when sampling x (t) S ∼ p θ (x (t) S | x (t+1) ). Our key insight is that because p θ (x (t-1) M | x (t) ) provides a mechanism to assess the likelihood of x (t-1) M , we can prioritize noised scaffolds that are more consistent with the motif. Particle filtering leverages this mechanism to provide a sequence of discrete approximations to each p θ (x (t) S | x (t-1:T ) M ) that look ahead by this extra step. Finally, at t = 0 we have an approximation to p θ (x (0) S | x (0:T ) M ). Then, using Proposition 4.1 below, we can obtain an approximate sample from q(x (0) S | x (0) M ). This framing permits the application of standard particle filtering algorithms (Doucet et al., 2001) . Algorithm 1 summarizes an implementation of this procedure that uses residual resampling (Doucet & Johansen, 2009) to mitigate the collapse of the sequential approximations into point masses. SMCDiff provides a tunable trade-off between computational cost and statistical accuracy through the choice of the number of particles K. In our next proposition we make this trade-off explicit. Proposition 4.1. Suppose that p θ exactly matches the forward diffusion process such that for every t+1) ) and consider any motif x (0) M . Let x S,K be a particle chosen at random from the output of Algorithm 1 with K particles. Then x S,K converges in distribution to q(x (0) x (t+1) , p θ (x (t) | x (t+1) ) = q(x (t) | x ( S | x (0) M ) as K goes to infinity. The significance of Proposition 4.1 is that it guarantees Algorithm 1 can provide arbitrarily accurate conditional samples provided an accurate diffusion model and large enough compute budget (determined by the number of particles). To our knowledge, SMCDiff is the first algorithm for asymptotically exact conditionally sampling from unconditional DPMs. Our proof of the proposition, which we leave to Appendix D, is obtained from an application of standard asymptotics for particle filtering (Chopin & Papaspiliopoulos, 2020, Proposition 11.4) .

5. EXPERIMENTS

We empirically demonstrate the ability of our method to scaffold motifs and sample protein backbone structures. We describe our procedure for evaluating backbone designs in Section 5.1. We demonstrate the promise of our method for the motif-scaffolding problem in Section 5.2. And we investigate our method's strengths and weaknesses via experiments in unconditional sampling in Section 5.3. We train a single instance of ProtDiff and use it across all of our experiments. For simplicity, we limited our training data to single chain proteins taken from PDB that are no longer than 128 residues. See Appendix F for training details. Baselines. As mentioned in Section 1.1, Wang et al. (2022) is the only prior machine learning work to address the motif-scaffolding problem. We do not compare against this as a baseline because no stable implementation was available at the time of writing. The most closely related method for unconditional sampling with available software is trDesign (Anishchenko et al., 2021) , but this method does not allow specification of a motif. The ML method most similar to ProtDiff is the concurrently developed equivariant diffusion model (EDM) proposed by Hoogeboom et al. (2022) . Like ProtDiff, EDM uses a densely connected EGNN architecture but without sequence-distance edge features. Consequently, it does not impose any sequence order, and therefore does not yield a way to relate generated coordinates to a backbone chain.

5.1. In silico EVALUATION OF DESIGNED BACKBONES

While experimental validation via X-ray crystallography remains the gold standard for evaluating computationally designed proteins, recent work (Wang et al., 2022; Lin et al., 2021) has proposed to leverage highly accurate protein structure prediction neural networks as an in silico proxy for true structure. More specifically, Wang et al. (2022) jointly design protein sequence and structure, and validate by comparing the design and AlphaFold2 (AF2) (Jumper et al., 2021) predicted structures. Here, our goal is to assess the quality of scaffolds generated independent of a specific sequence, so we treat fixed backbone sequence design as a downstream step as in Lin et al. (2021) . Our evaluation with AF2 is as follows. For each generated scaffold we use a C-α only version of ProteinMPNN (Dauparas et al., 2022) with a temperature of 0.1 to sample 8 amino acid sequences likely to fold to the same backbone structure. We then run AF2 with the released CASP14foot_0 weights and 15 recycling iterations. We do not include a multiple sequence alignment as an input to AF2. Our choice of utilizing ProteinMPNN and AF2 (without MSAs) is motivated by their empirical success in various de novo protein design tasks and the ability to recapitulate native proteins (Dauparas et al., 2022; Bennett et al., 2022) . To assess unconditionally sampled scaffolds, we then evaluate the agreement of our backbone sample with the AF2 predicted structures using the maximum TMscore (Zhang & Skolnick, 2005) across all generated sequences which we refer to as scTM, for self-consistency TM-score. To assess whether prospective scaffolds generated support a motif, we compute the root mean squared distances (RMSD) of the desired and predicted motif coordinates after alignment and refer this metric as the motif RMSD. Appendix Algorithm 4 outlines the exact steps. Because a TM-score > 0.5 indicates that two structures have the same fold (Zhang & Skolnick, 2005) , we say that a backbone is designable if scTM > 0.5. The ability for AF2 to reproduce the same backbone from an independently designed sequence is evidence a sequence can be found for the starting structure. To verify this is a reasonable cutoff, we analyzed scTM over our training set and found 87% to be designable.

5.2. MOTIF-SCAFFOLDING VIA CONDITIONAL SAMPLING

We evaluated our motif-scaffolding approach (combining SMCDiff and ProtDiff) on motifs extracted from existing proteins in the PDB and found that our approach can generate long and diverse scaffolds that support these motifs. We chose to first evaluate on motifs extracted from proteins present in the training set because we knew that at least one stabilizing scaffold exists. We considered 2 examples taken from the PDB with IDs 6exz and 5trv, which are 69 and 118 residues long, respectively. We chose these examples due to their high secondary structure composition while being representative of the shortest and longest lengths seen during training. For each structure, we chose a 15-25 residue helical segment as the motif (see Appendix H for details). The remainder of each protein is one possible supporting scaffold. We sought to assess if we could recover this and other scaffolds with the same size and motif placement. Based on prior work (Wang et al., 2022) , we expected that building larger scaffolds around a motif would be more challenging than building smaller scaffolds. To assess this length dependence, we expanded the segment of used as the motif when running SMCDiff by including additional residues on each side. In each case, though, we compute the motif RMSD over the minimal motif. In Figure 2B , we present motif-scaffolding performance and its dependence on scaffold size for 5trv, the longer of the two test proteins. For the 5trv test case, the lower quartile of the motif RMSD for SMCDiff is below 1Å for scaffolds up to 80 residues. Since 1Å is atomic-level resolution, we conclude that our approach can succeed in this length range. Figure 2A provides a visualization of our method's capacity to generate long and diverse scaffolds. The figure depicts two dissimilar scaffolds of lengths 34 and 54 produced by SMCDiff with 64 particles. Both scaffolds are designable and agree with AF2 (scTM > 0.5). Diversity is particularly evident in the different orderings of secondary structures. Figure 2B compares SMCDiff to two naive inpainting methods, fixed and replacement. In fixed, the motif is fixed for every timestep t, and the reverse diffusion is applied only to the scaffold (as done by Zhou et al. (2021) ); replacement is the method described in Section 4. In contrast to SMCDiff, these baselines fail to generate a successful scaffolds longer than 50 residues on 5trv, as determined by the location of their lower quartiles. We next applied these three inpainting methods to harder targets in order to measure generalization to out-of-distribution and more difficult motifs comprising of dis-contiguous regions and loops. We consider a motif obtained from the respiratory syncytial virus (RSV) protein and calcium binding EF-hand motif, both of which are not in the training dataset. RSV is known to be difficult due to its composition of helical, loop, and sheet segments, while EF-hand is a dis-contiguous loop motif found in a calcium binding protein. More details about both motifs can be found in Wang et al. (2022) ; there the authors report the only known successful scaffold of these motifs but they attain it with a computationally intensive hallucination approach. We found that our method failed to generate scaffolds predicted to recapitulate the motif (Appendix H); however, SMCDiff provided smaller median motif RMSDs than the other two inpainting methods. Compute cost. The computation of SMCDiff with 64 particles is approximately 2 minutes per independent sample, while alternative methods fixed and replacement can produce 64 independent samples in the same time. By contrast, the hallucination approach of Wang et al. (2022) involves running a Markov chain for thousands of steps, and has runtime on the order of hours for a single sample (Anishchenko et al., 2021) .

5.3. UNCONDITIONAL SAMPLING

We next investigate the origins of the diversity seen in Figure 2 by analyzing the diversity and designability of ProtDiff samples without conditioning on a motif. We first check that ProtDiff produces designable backbones. To do this, we generated 10 backbone samples for each length between 50 and 128 and then calculated scTM for each sample. In Fig. 3A , we find that 11.8% of samples have scTM > 0.5. However, the majority of backbones do not pass this threshold. We also observe designability has strong dependence on length since we expect that longer proteins are harder to model in 3D and design sequences for. We separated the lengths below 128 residues into two categories and refer to them as short (50-70) and long (70-128). Our results in Figure 3A indicate 17% of designs in the short category are designable vs. 9% in the long category. In Figure 13 , we present a structural clustering of these designable backbones; we find that these backbones exhibit diverse topologes. We next sought to evaluate the ability of ProtDiff to generalize beyond the training set and produce novel backbones. In Figure 3B each point represents a backbone sample from ProtDiff. The horizontal coordinate of a point is the scTM, and the vertical coordinate is the minimum TM-score across the training set. We found a strong positive correlation between scTM and this minimum TM-score, indicating that many of the most designable backbones generated by ProtDiff were a result of training set memorization. However, if the model were only memorizing the training set, we would see TM-scores consistently near 1.0; the range of scores in Figure 3B indicate this is not the case -and the model is introducing a degree of variability. Figure 3C gives an example of backbone with scTM > 0.5 that appears to be novel. Its closest match in the PDB has TM-score = 0.54. Fig. 3B illustrates a limitation of our method: many of our sampled backbones are not designable. One contributing factor is that ProtDiff does not handle chirality. Hence ProtDiff generates backbones with the wrong handedness, which cannot be realized by any sequence. Fig. 3B shows that 45% of all backbone samples had at least one incorrect, left-handed helix. Of these, most have scTM < 0.5. We describe calculating left-handed helices in Appendix G. Fig. 4 illustrates an interpolation between two samples, showing how ProtDiff's outputs change as a function of the noise used to generate them. To generate these interpolations, we pick two backbone samples that result in different folds. For independent samples generated with noise ϵ (0:T ) and ε(0:T ) we interpolate with noise set to √ αϵ (0:T ) + √ 1 -αε (0:T ) for α between 0 and 1. The depicted values of α are chosen to highlight transition points with full interpolations included in Appendix H.3. A future direction is to exploit the latent structure of ProtDiff to control backbone topology.

6. DISCUSSION

The motif-scaffolding problem has applications ranging from medicine to material science (King et al., 2012) , but remains unsolved for many functional motifs. We have created the first generative modeling approach to motif-scaffolding by developing ProtDiff, a diffusion probabilistic model of protein backbones, and SMCDiff, a procedure for generating scaffolds conditioned on a motif. Although our experiments were limited to a small set of proteins, our results demonstrate that our procedure is the first capable of generating diverse scaffolds longer than 20 residues with computation time reliably on the order of minutes or less. Our work demonstrates the potential of machine learning methods to be applied in realistic protein design settings. General conditional sampling. SMCDiff is applicable to generic DPMs and is not limited to only proteins and motif-scaffolding. While we do not make claims of SMCDiff outperforming state-of-the-art conditional diffusion models on other tasks such as image generation, we demonstrate a clear advantage of SMCDiff over the replacement method on a toy task of inpainting MNIST images in Appendix I. Extending SMCDiff outside of motif-scaffolding is outside the scope of the present work, but the advantages of a single model for both unconditional and conditional generation warrants additional research. Modeling limitations. Our present results do not indicate our procedure can generalize to motifs that are not present in the training set. We believe improvements in protein modeling could provide better inductive biases for generalization. ProtDiff, based on EGNN, is reflection equivariant since it only sees pairwise distances between 3D C-α coordinates. Additionally, ProtDiff does not explicitly model primary sequence or side-chains. Hoogeboom et al. (2022) demonstrate the benefits of modeling sequence information in small molecules; joint modeling sequence and structure in a single model could improve the designability of protein scaffolds and backbones as well. Data limitations. We remarked our training set is small due to filtering based on length and oligometry (using only monomeric proteins). Scaling up to longer proteins opens up thousands more examples from the PDB, but in preliminary experiments has proven challenging. Lastly, further development and comparison of methods for motif scaffolding will benefit from standard evaluation benchmarks. Developing a benchmark proved to be difficult since motifs are not labeled in protein databases. It will be important to gather motifs of biological importance in order to guide ML method development towards real-world applications. Because no such benchmarks exist, developing them is a valuable direction for future work. 

A ADDITIONAL RELATED WORK

We next cover additional related work on generative models of proteins sequence and structure, beyond the discussion in Section 1.1. Following the success of deep language models, Ferruz et al. (2022) developed protein sequence models to generate new proteins, but these models do not allow specification of structural motifs. Another class of methods, referred to as fixed backbone sequence design (Fleishman et al., 2011; Ingraham et al., 2019; Xiong et al., 2020; McPartlon et al., 2022; Hsu et al., 2022) , attempts to solve the problem of identifying a sequence that folds into any given designable backbone structure. In the present work, we utilize a particular sequence design method, ProteinMPNN (Dauparas et al., 2022) , but in principle any other fixed-backbone sequence design method could be used in its place. Anand & Huang (2018) ; Lin et al. (2021) ; Wu et al. (2021) propose generative adversarial networks, variational autoencoders, and energy-based models, respectively, on distance matrices, but these approaches (1) do not generate backbones compatible with a specified motif and (2) rely on an unwieldy optimization step to translate the distance matrix into backbone coordinates. Other authors use neural net (Tischer et al., 2020; Anishchenko et al., 2021; Wang et al., 2022; Huang et al., 2022; Wu et al., 2021) , but require a computationally challenging conformational landscape exploration.

B PROBLEM ASSUMPTIONS AND MODELING HEURISTICS

The formulation of the motif-scaffolding problem presented in Section 2.1 makes several simplifying assumptions, and our modeling approach relies on several heuristics. We describe these assumptions and heuristics in what follows, and comment on how they might be addressed by further methodological developments. But we first describe an illustrative example of an instance of motif scaffolding. Protein sequence-structure relationship. Generally speaking, a protein's sequence encodes an ensemble of conformations, populated to different degrees at biological temperatures. Anfinsen's hypothesis states that the ground state conformation is thermodynamically accessible (Anfinsen, 1973) , providing a mapping from sequence to a unique (ground state) structure. In practice, the ground state structures make up the vast majority of experimentally determined protein conformations, as over 95% of structures in the Protein Data Bank (PDB) are collected at cryogenic temperatures (Fraser et al., 2011) . Thus we simplify our problem by saying that a sequence uniquely maps to a static structure (i.e. the ground state structure). However, violations of this assumption arise in some PDB structures as a result of (1) of context specific determinants of structure such as post-translational modifications and environmental factors including pH, binding partners, and salts, as well as (2) thermodynamic inaccessibility of the ground state. Motif sequence and side-chains. As stated in Section 2.1, we assume we may represent a functional motif by the coordinates of its C-α atoms. However, the biochemical functions of proteins depend not only on backbone structure, but also on side-chains. For example, the activity of many enzymes is imparted by triplets of residues, known as catalytic triads, whose ability to catalyze reactions depends on the spatial organization of side-chain atoms. Our problem statement and subsequent evaluation scheme are agnostic to the amino acid identity of motif residues, let alone side-chain positioning. A more complete representation of a motif would include the side-chain identities (i.e. the amino acid sequence) and side-chain atom coordinates. Scaffold length and motif placement. We have additionally assumed that the size of scaffolds and the indices of motif residues within the backbone chain, M, are known a priori. However, in practice satisfactory scaffolds could have different lengths and different motif placements, and typically it is not known a priori what lengths and placements will be best. Previous works have addressed this challenge through brute force by sampling multiple lengths and placements, and relied on post-hoc filtering to identify the most promising scaffolds (Wang et al., 2022; Yang et al., 2021) . Subsequent work on ML methods could potentially generalize beyond this assumption to efficiently sample appropriate scaffold lengths and motif placements. Sequence and side-chain modeling. ProtDiff models only the backbone coordinates and leaves sequence design to a subsequent stage, for which we have used ProteinMPNN. A more complete representation of a proteins could include both sequence and structure (where structure can be divided into the backbone and side-chain atom coordinates). To model sequence, we rely on a separately trained neural network, ProteinMPNN, but this is not ideal. Unless ProtDiff produces perfect backbones, one would expect the backbone samples of ProtDiff to present a substantial domain shift when used as input for ProteinMPNN. 3D backbone representation. In this work, we represent a protein structure using the C-α coordinates of every residue along the backbone. However, this representation is coarse-grained and ignores additional backbone atomic coordinates, namely the backbone carbon and nitrogen atoms. Dauparas et al. (2022) observed additionally modeling the heavy atoms of the backbone nitrogen and carbon atoms along with the C-β of every residue (to capture side-chain information) improved performance (by sequence recovery) for fixed-backbone sequence design. We hypothesize modeling additional coordinates of every residue would also improve designability performance of ProtDiff. Constraining ProtDiff to place the remaining atoms in the correct orientation could help enforce correct chirality and mitigate chain breaks.

C ADDITIONAL PR O TDI F F DETAILS

As a reminder from Section 3, each node in the graph is indexed by n = 1, . . . , N and corresponds to a residue with coordinates x n ∈ R 3 and node features h n ∈ R D . For each pair of nodes n, n ′ we define an edge and associate it with edge features a nn ′ ∈ R D . Our neural network to predict ϵ θ is an instance of EGNN composed of multiple EGCL layers . We recount details of EGCL and then discuss construction of edge and node features, a nn ′ and h n . Equivariant graph convolution layers (EGCL). Each layer l = 1, . . . , L defines an update as (x l , h l ) = EGCL[x l-1 , h l-1 ] where for each node n x l n = x l-1 n + n ′ ̸ =n ⃗ ω nn ′ • ϕ x (h l-1 n , h l-1 n ′ , d nn ′ , a nn ′ ) and h l n = ϕ h (h l-1 n , m n ), for ⃗ ω nn ′ = x l-1 n -x l-1 n ′ √ d nn ′ + γ , m n = n ′ ̸ =n ϕ e (h l-1 n , h l-1 n ′ , d nn ′ , a nn ′ ), d nn ′ = ∥x l-1 n -x l-1 n ′ ∥ 2 2 . ϕ e , ϕ h , and ϕ x are fully connected neural networks, and γ is a small positive constant included for numerical stability. The first EGCL layer takes in initial node embeddings, h 0 while edge embeddings, a nn ′ , are kept fixed throughout. Initial node and edge embeddings. Each edge between two residues indexed in the sequence by (n, n ′ ) is featurized with D features obtained through a sinusoidal encoding of relative offset: a nn ′ =    φ(n -n ′ , 1) . . . φ(n -n ′ , D)    , where φ(x, k) = sin x • π/N 2•k/D , k mod 2 = 0 cos x • π/N 2•(k-1)/D , k mod 2 = 1. For node features, we similarly use a sinusoidal encoding of sequence position as well as of the diffusion time step t as h n (t) =    φ(n, 1) . . . φ(n, D)    + R    φ(t, 1) . . . φ(t, D)    , where R is a D×D orthogonal matrix chosen uniformly at random. Intuitively, applying R transforms the time encoding to be orthogonal to the positional encoding. Coordinate scaling While protein structures are typically parameterized in Angstroms, we transform the input protein coordinates to be in nanometers rather by dividing by 10. This scaling brings the backbones to a spatial scale similar to the reference distribution at which the forward noising process is stationary, a unit variance isotropic Gaussian. Importantly, the distribution of the final step T is indistinguishable from an isotropic Gaussian (Supplementary Fig. 5 .) Figure 5 : Distribution of x (T ) after centering and scaling x (0) to nanometers.

PROOFS

We here provide additional details related to SMCDiff and the replacement method described in Section 4. Details of the replacement method (Song et al., 2021) and our analysis of its error are in Algorithm 2 Replacement method for approximate conditional sampling 1: Input: x (0) M (motif) 2: // Forward diffuse motif 3: x(1:T ) M ∼ q(x (1:T ) M | x (0) M ) 4: 5: // Reverse diffuse scaffold 6: x (T ) ∼ p θ (x (T ) ) 7: for t = T, . . . , 1 do 8: // Replace with forward diffused motif 9: x (t) ← [x (t) M , x (t) S ] 10: 11: // Propose next step 12:  x (t-1) ∼ p θ (x (t-1) | x (t) ) 13: end for 14: Return x (0) S , x (1:T ) Algorithm 3 Residual Resample 1: Input: w 1:K (weights), x 1:K (particles) 2: ∀k, (c k , r k ) ← (⌊Kw k ⌋, Kw k -⌊Kw k ⌋) 3: xC = [x 1 , . . . , x 1 c1 , . . . , x K , . . . , x K c K ] 4: R ← K - K k=1 c k 5: [i 1 , . . . , i R ] ∼ Multinomial(r 1:K , R) 6: xR ← [x i1 , . . . , x i R ] 7: x = concat(x R , K (•) is an approximation to p S,1 (• | x (0:T ) M ). Proving the proposition amounts to showing that in the limit as K goes to infinity, each P ). This weak convergence follows from standard asymptotics for particle filters (Chopin & Papaspiliopoulos, 2020, Proposition 11.4 ), which we make explicit in Lemma D.1. As a result, if we perform step (3) with x (1) S ∼ P (1) K (•), then this lemma implies that x (0) S converges in distribution to q S,0 (x (0) S | x (0) M ), since (i) q S,0 (x (0) S | x (1) M , x (1) S ) is continuous in x (1) S and (ii) x (0) S is independent of x (0) M conditional on x (1) . Recall that to show the proposition, it was to sufficient to show that P (1) K converged weakly to q S,1 (• | x (0:T ) M ); this implied that the K particle returned by Algorithm 1 would then converge in distribution to q S,0 (• | x (0:T ) M ) which, by the law of total probability, implied that they marginally converge to q S,0 (• | x (0) M ). However, while the particles return by Algorithm 1 may be treated as exchangeable, they are not independent, because they depend on shared randomness in x (1:T ) M . To obtain approximate samples that are independent, it is necessary to run Algorithm 1 multiple times. Residual resampling. Line 14 of Algorithm 1 indicates a Resample step. In particle filtering, resampling steps (or branching mechanisms (Doucet et al., 2001, Chapter 2 )) filter out particles with very small weights, and replace them with additional copies of particles with large weights. Notably, the resampling step is the only point of departure of Algorithm 1 from the replacement method; without resampling, the algorithms behave identically. While a variety of possible branching mechanisms exist, we use residual resampling (Algorithm 3) in our implementation for its simplicity.  K (A) = A p S,1 (x | x (0:T ) M )dx. Proof. The proof of the lemma follows from an application of standard asymptotics for particle filtering (Chopin & Papaspiliopoulos, 2020, Proposition 11.4) . In particular, to apply Proposition 11.4 we use the formalism of Feynman-Kac (FK) models, following the notation of (Chopin & Papaspiliopoulos, 2020, Chapter 5) . Though typically (and in (Chopin & Papaspiliopoulos, 2020) ) FK models are defined via a sequence of approximations at increasing time steps, we consider decreasing time steps because we are approximating the reverse time process. We take the initial distribution as M T (x (T ) S ) = p S,T (x (T ) S ), the transition kernel as M t (x (t+1) S , x (t) S ) = p S,t (x (t) S | x (t+1) ), and and observe that this lower bound is monotonically decreasing in σ 2 1 for σ 2 1 ≤ σ 2 2 . Therefore EKL q S,t (• | x (t+1) S , x (0) M )∥q S,t (• | x (t+1) S ) = q M,0 (x (0) M )q S,t+1 (x (t+1) S | x (0) M ) KL q S,t (• | x (t+1) S = x (t+1) S , x M = x (0) M ])∥q S,t (• | x (t+1) S = x (t+1) S ]) dx (0) M x (t+1) S ≥ q M,0 (x (0) M )q S,t+1 (x (t+1) S | x (0) M ) KL N (0, Var[x (t) S | x (t+1) S = x (t+1) S , x M = x (0) M ])∥N (0, Var[x (t) S | x (t+1) S = x (t+1) S ]) dx (0) M x (t+1) S ≥ KL N (0, β(1 -βρ 2 α))∥N (0, β) ≥ 1 2 log β β(1 -βρ 2 α) + β(1 -βρ 2 α) β -1 = - 1 2 log(1 -βρ 2 α) + βρ 2 α where the second inequality follows from Lemma D.2, and the monotonicity of the KL in σ 2 1 . Proof of Lemma D.2: Proof. That Var[x (t) S | x (t+1) S ] = β follows immediately from that [x (t) S , x ] is marginally bivariate normal distributed with covariance √ 1 -β. The upper bound on Var[x (t) S | x (t+1) S , x M ] is trickier. Observer that [x (t) S , x (t+1) S ] | x M is bivariate Gaussian and that Var x (t) S x (t+1) S | x (0) M = 1 -ρ 2 α √ 1 -β(1 -ρ 2 α) √ 1 -β(1 -ρ 2 α) 1 + βρ 2 α -ρ 2 α . As such, the conditional variance may be computed in closed form as Var[x (t) S | x (t+1) S , x M ] = β(1-ρ 2 α)+(1-β)(1-ρ 2 α) 1 -(1 -ρ 2 α)/(1 -ρ 2 α + βρ 2 α) . But since (1-ρ 2 α)/(1-ρ 2 α+ βρ 2 α) ≥ 1-(βρ 2 α)/(1-ρ 2 α) and therefore 1-(1-ρ 2 α)/(1-ρ 2 α+βρ 2 α) ≤ (βρ 2 α)/(1-ρ 2 α) we can write Var[x (t) S | x (t+1) S , x (0) M ] = β(1 -ρ 2 α) + (1 -β)(1 -ρ 2 α) 1 - 1 -ρ 2 α 1 -ρ 2 α + βρ 2 α ≤ β(1 -ρ 2 α) + (1 -β)(1 -ρ 2 α) βρ 2 α 1 -ρ 2 α ) = β(1 -ρ 2 α) + (1 -β)βρ 2 α = β(1 -βρ 2 α).

E DETECTING CHIRALITY

Section 6 noted the limitation of ProtDiff that it can generate left-handed helices (which do not stably occur in natural proteins). Figure 6 presents two such examples. We additionally note that, as in Figure 6 Left, model samples can include multiple helices with differing chirality. Training was performed using the Adam optimizer with hyperparameters learning_rate=1e-4, β 1 = 0.9, and β 2 = 0.999. We trained for 1,000,000 steps using batch size 16. We used a single Nvidia A100 GPU for approximately 24 hours. We implemented all models in PyTorch. We used the same linear noise schedule as Ho et al. (2020) where β 0 = 0.0001, β T = 0.02, and T = 1024. We did not perform hyperparameter tuning.

G ADDITIONAL METRIC DETAILS

Self-consistency algorithm. Section 5.1 described our self-consistency metrics for evaluating the designability of backbones generated with ProtDiff. Algorithm 4 makes explicit the procedure we use for computing these metrics. Algorithm 4 Self-consistency calculation Input: x ∈ R N,3 1: for i ∈ 1, . . . , 8 do 2: s i ← ProteinMPNN(x) 3: xi ← AF2(s i ) 4: end for 5: sc_tm ← max i∈1,...,8 TMscore(x i , x) Output: , sc_tm Using dihedral angles to calculate helix chirality. Natural proteins are chiral molecules that contain only right-handed alpha helices. However, because the underlying EGNN in our model is equivariant to reflection, it can produce samples with left-handed helices. While examining model samples, we additionally observed samples with both left and right-handed helices (Figure 6 ), even though in theory the EGNN should be able to detect and avoid the chiral mismatch. Lefthanded helices are fundamentally invalid geometries in proteins and represent a trivial failure mode when calculating the self-consistency and other metrics. Samples with a mixture of left and right-attempted to scaffold this motif into a 62 residue protein, with the motif as residues 42-62. We chose this placement because previous work (Wang et al., 2022) identified a promising candidate scaffold with this motif placement. In contrast to the cases described in the main text, for which a suitable scaffold exists in the training set, SMCDiff and the other inpainting methods failed to identify scaffolds that recapitulated this motif to within a motif RMSD of 1 Å. The C-α atoms can be spaced further than the typical 3.8Å between neighbors, resulting in a chain break (dashed lines). Additionally, backbone segments can be too close to each other, resulting in obvious overlaps and clashes. (B) Backbones with a mixture of left (circled in red) and right (circled in green) handed helices. These chirality errors cannot be corrected simply by mirroring the sampled backbone. Figure 9 : Additional inpainting results on a more challenging motif extracted from the respiratory syncytial virus (RSV) and EH-hand motif. The three inpainting methods are evaluated as described in Section 5.

H.2 QUALITATIVE ANALYSIS OF S CTM IN DIFFERENT RANGES

In this section, we give intuition for backbone designs and AF2 predictions associated with different values of scTM to aid the interpretation of the scTM results provided in Section 5. Figure 10 examines a possible categorization of scTM in three ranges. The first two rows correspond to backbone designs that achieve scTM > 0.9. We see the backbone designs in the first column closely match the AF2 prediction in the second column. A closely related PDB example can be found when doing a similarity search of the highest PDB chain with the highest TM-score to the AF2 prediction. We showed in Figure 3B that scTM > 0.9 is indicative of a close structural match being found in PDB. The middle two rows correspond to designs that achieve scTM ∼ 0.5. These are examples of backbone designs on the edge of what we deemed as designable (scTM > 0.5). In these cases, the AF2 prediction shares the same coarse shape as the backbone design but possibly with different secondary-structure ordering and composition. In the length 69 example, we see the closest PDB chain has a TM-score of only 0.65 to the AF2 prediction but roughly the same secondary-structure ordering as the backbone design. The length 100 sample is a similar case of AF2 producing a roughly similar shape to the backbone design, but has no matching monomer in PDB. The final category of scTM < 0.25 reflects failure cases when scTM is low. The AF2 predictions in this case have many disordered regions and bear little structural similarity with the original backbone design. Similar PDB chains are not found. We expect that improved generative models of protein backbones would not produce any samples in this category. I APPLICABILITY OF SMCDI F F BEYOND PROTEINS: MNIST INPAINTING Our goal in this section is to study the applicability of SMCDiff beyond motif-scaffolding, by applying it to inpainting on the MNIST digits dataset. We compare SMCDiff with the replacement method on the task of sampling the remaining half of MNIST digits. We first train DDPM with β 1 = 10 -4 , β T = 0.2, T = 1000 using a small 8-layer CNN on MNIST with batch size 128 and ADAM optimizer for 100 epochs until it is able to generate reasonable MNIST samples (Figure 14 ). We then selected 3 random MNIST images and occluded the right half. The left half would then serve as the conditioning information to the diffusion model (Figure 15 ). For each occluded image, we fixed a single forward trajectory and sampled 16 images from each method: replacement method and SMCDiff with 16 or 64 particles (K). Results are shown in Fig. 16 . We observe the replacement method can sometimes produce coherent samples as a continuation of the conditioning information, but more often it attempts to produce incoherent digits. SMCDiff on the other hand tends to produce digits that compliment the conditioning information. For more difficult occlusions, such as 5 and 0, SMCDiff can still fail although increasing the number of particles (K = 64) tends to produce samples that are more visually coherent. It is important to note SMCDiff has additional computation overhead based on the number of particles. It can be more expensive than replacement method but result in higher quality samples. Investigating SMCDiff in more difficult datasets with improved architectures is a direction of future research. Figure 16 : MNIST inpainting results for replacement and SMCDiff. See text for explanation.



Biannual protein folding competition where AF2 achieved first place. Weights available under Apache License 2.0 license.



Figure 2: Motif-scaffolding case studies. (A) Example of two scaffold structures generated around a segment of 5trv. Orange: desired input motif, Grey: AlphaFold-predicted structure of two scaffolds, with the motif highlighted (purple). Both scaffolds were sampled using SMCDiff with scTM > 0.5. (B,C) Motif RMSD for 5trv and 6exz test cases, its dependence on scaffold size, and comparison of SMCDiff to two naive inpainting methods (fixed, replacement).

Figure 3: Protein backbone samples from ProtDiff. (A) Density plot of scTM for different length categories (50-70, 70-128). The dashed line at scTM = 0.5 indicates the threshold of "designability", points to the right are considered "designable" (see text). (B) Scatter plot of scTM and the highest TM-score of each sample to all of PDB. Points represented as a grey "×" are detected to contain an (invalid) left-handed helix. Dashed lines indicate thresholds scTM = 0.5. (C) Example of a designable backbone sample (rainbow) with scTM > 0.5 (boxed in red in panel B) to its closest PDB example (6c59, grey) with a TM-score of 0.54.

Figure 4: Interpolations between ProtDiff samples demonstrating the diversity of backbones captured. Top: 64-residue example. Bottom: 56-residue example. ProtDiff samples are determined by the Gaussian noise across all steps, ϵ (0:T ) .

) converges weakly to p S,1 (• | x (0:T ) M ), which by assumption is equal to q S,1 (• | x (0:T ) M

K are as constructed in Algorithm 1. Assume the conditions of Proposition 4.1. Then P (1) K converges weakly to p S,1 (• | x (0:T ) M ) as K goes to infinity. That is, for any Borel measurable A, lim K→∞ P (1)

Figure 6: Two examples of protein backbone samples with incorrect left handed helices.

Figure 8: Failure modes in ProtDiff backbone samples. (A) Backbone clashes and chain breaks.The C-α atoms can be spaced further than the typical 3.8Å between neighbors, resulting in a chain break (dashed lines). Additionally, backbone segments can be too close to each other, resulting in obvious overlaps and clashes. (B) Backbones with a mixture of left (circled in red) and right (circled in green) handed helices. These chirality errors cannot be corrected simply by mirroring the sampled backbone.

Figure 10: Qualitative analysis of unconditional backbone samples from ProtDiff. The first column displays backbone designs from ProtDiff and their sequence lengths. The second column displays the highest scTM scoring AF2 predictions from the ProteinMPNN sequences of the corresponding backbone design in the first column. The third column displays the closest PDB chain to the AF2 prediction in the second column with the PDB ID and TM-score written below. The third column is blank for the last two rows since no PDB match could be found. See Appendix H.2 for discussion.

Figure13: Clustering of self-consistent ProtDiff samples. The distance matrix is 1 -TM-score between pairs of samples, and ranges from 0 (exact matach) to 1 (no match). Dendrograms are from hierarchical clustering using the average distance metric. Designs on the right are cluster centroids. Gray lines connect larger clusters with more than one member to its centroid, while the remaining designs are from a random selection of the remaining single-sample clusters. Protein backbones are colored from blue at the N-terminus to red at the C-terminus.

Figure 14: Unconditional MNIST samples.

Figure 15: Full MNIST images and their occluded halves used for inpainting experiments.

The replacement method and its error . . . . . . . . . . . . . . . . . . . . . . . 17 D.2 SMCDiff details and verification proof of Proposition 4.1 . . . . . . . . . . . . 18 D.3 Proofs and lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Additional motif-scaffolding results . . . . . . . . . . . . . . . . . . . . . . . . 24 H.2 Qualitative analysis of scTM in different ranges . . . . . . . . . . . . . . . . . 26 H.3 Additional latent interpolation results . . . . . . . . . . . . . . . . . . . . . . . 28 H.4 Structural clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Motif-scaffolding test case additional details.

ACKNOWLEDGEMENTS

The authors thank Octavian-Eugen Ganea, Hannes Stärk, Wenxian Shi, Felix Faltings, Jeremy Wohlwend, Nitan Shalon, Gabriele Corso, Sean Murphy, Wengong Jin, Bowen Jing, Renato Berlinghieri, John Yang, and Jue Wang for helpful discussion and feedback. We thank Justas Dauparas for access to an early version of ProteinMPNN. We dedicate this work in memory of Octavian-Eugen Ganea who initiated the project by connecting all the authors. BLT and JY were supported in part by an NSF-GRFP. BLT and TB were supported in part by NSF grant 2029016 and an ONR Early Career Grant. JY, RB, and TJ acknowledge support from NSF Expeditions grant (award 1918839: Collaborative Research: Understanding the World Through Code), Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium, the Abdul Latif Jameel Clinic for Machine Learning in Health, the DTRA Discovery of Medical Countermeasures Against New and Emerging (DOMANE) threats program, the DARPA Accelerated Molecular Discovery program and the Sanofi Computational Antibody Design grant. DT and DB were supported with funds provided by a gift from Microsoft. DB was additionally supported by the Audacious Project at the Institute for Protein Design, the Open Philanthropy Project Improving Protein Design Fund, an Alfred P. Sloan Foundation Matter-to-Life Program Grant (G-2021-16899) and the Howard Hughes Medical Institute.

annex

Appendix D.1. Appendix D.2 provides details of our sampling method, SMCDiff, including (1) a proof of Proposition 4.1 and (2) details of the residual resampling step. We leave technical proofs and lemmas to Appendix D.3.Notation. In the following, we require notation that is more precise than in previous sections. For each t = 0, . . . , T, we let q t (•) and p t (•) denote the density functions of x (t) according to the forward process and to our neural network approximation of the reverse process, respectively. We denote densities restricted to the motif and scaffold with subscripts M and S. For example, we here write p M,t (x (t) M ), whereas we wrote p θ (x (t) M ) in the main text. We write (random) conditional densities as q M,t (• | x (t-1) M) and write the (deterministic) conditional density for an observation xAn object of interest will be the Kullback-Leibler (KL) divergence.We write KL [q t (•)∥p t (•)] := q t (x) log qt(x) pt(x) dx, where log(•) is the natural (base e) logarithm. We will also encounter the expected KL between conditional densities, which we will write aswhere the outer expectation is taken with respect to the unconditional density associated with first argument of EKL [•∥•] .

D.1 THE REPLACEMENT METHOD AND ITS ERROR

The replacement method was proposed by Song et al. (2021) for the task of inpainting in the context of score-based generative models. Work (Ho et al., 2022) concurrent with the present paper applied the replacement method to DPMs. Although Song et al. (2021) notes that this approach can be understood as approximate conditional sampling, they provide no discussion of approximation error. We here show that the replacement method introduces irreducible error that is inherent to the forward process. Algorithm 2 provides an explicit description of the replacement method.The first return of Algorithm 2, x (0) S , is used as a putative inpainting solution or approximate conditional sample. But Algorithm 2 additionally returns subsequent time steps, x (1:T ) . We denote the approximation over all steps implied by the generative procedure in Algorithm 2 by p Repl 1:T (• | x (0) M = x M ) and compare it to the exact conditional, q 1:T (• | x (0) M = x M ). We here consider error in KL divergence because it permits an analytically tractable and transparent analysis. We additionally consider the idealized scenario where p 0:T (•) perfectly captures the reverse process. Under this condition, the forward KL takes a surprisingly simple form.Proposition D.1. Suppose that p 0:T (•) exactly matches the forward diffusion process such that for every x, p t ((2) Proposition D.1 reveals that the replacement method introduces approximation error that is intrinsic to the forward process and cannot be eliminated by making p 0:T (•) more expressive. Although the individual terms in the right hand side of Equation ( 2) are not analytically tractable in general, in the following corollary we show that this approximation error can be non-trivial by considering a special case. For this following example, we depart from the earlier assumption that x is in 3D, and consider scalar valued x M and x S .S ] is bivariate normal distributed with mean zero, unit variance, and covariance ρ. Further suppose that q S,tS , β (t+1) ) as in Section 2, where β (t+1) and ᾱ(t) are between 0 and 1. ThenWe note two takeaways of Corollary D.2. First, as we might intuitively expect, this error can be large when significant correlation in the target distribution is present. Second, we see that the approximation error can be larger at earlier time steps, when ᾱ(t) is closer to 1.

D.2 SMCDI F F DETAILS AND VERIFICATION PROOF OF PROPOSITION 4.1

The idea behind the SMCDiff procedure in Algorithm 1 is to break sampling of x (0)M ) into three stages:If all three steps were performed exactly, by the law of total probability x (0) S in step (3) would (marginally) be an exact sample from q S,0 (• | x (0) M ). As such, SMCDiff aims to perform step (1) and approximate steps (2) and (3).Step (1) corresponds to forward diffusing the motif in lines 2-3 and is exact because we diffuse according to q.Step (3) corresponds to line 17 in the last iteration (when t = 1). Specifically, to sample from q S,0) we make three observations. (i) The Markov structure of the forward process implies that q S,0S ). (ii) By the assumption that the forward and approximated reverse process agree, we have q S,0 (• | xS ). As a result, under the assumptions of the proposition, we may sample from q S,0 (• | x), and perform step (3) exactly as well.Step (2) is the only non-trivial step, and cannot be performed exactly. The challenge is that although the reverse process approximation, p S,1:T (• | x (0:T ) M ), is well-defined, computing it explicitly involves an intractable, high-dimensional integral.The sequential Monte Carlo approach of SMCDiff, then, is to circumvent this intractability by constructing a sequence of approximations. For each t = T, T -1, . . . , 1, we approximate p S,t (• | x (t-1:T ) M) (and thereby q S,t (• | x (t-1:T ) )) with K weighted atoms (the particles). We denote these the potential functions as G t (x). The sequence of FK models, Q t , then correspond tofor each t, where L t is a normalizing constant.By substituting in our choices of M t and G t , we can rewrite and simplify Q t aswhere lines 3 and 4 drop multiplicative constants that do not depend on x (t:T ) S. From the above derivation, we see that each Q t (x), and in particular that Q 1 (xAs such, the desired convergence in the statement of the lemma is equivalent to that P(1) Chopin & Papaspiliopoulos (2020, Proposition 11.4 ) provide this result for the generic particle filtering algorithm (see Chopin & Papaspiliopoulos (2020, Algorithm 10 .1), which is written in the FK model form described above). More specifically, Proposition 11.4 proves almost sure convergence of all Borel measurable functions of P (t) K , which implies the desired weak convergence. Although the proof provided in Chopin & Papaspiliopoulos (2020) is restricted to the simpler, but higher variance, case where the resampling step uses multinomial resampling, the authors note that Chopin (2004) proves it holds in the case of residual resampling (which we use in our experiments) as well.

Replacement method error -lemmas and proofs

We here provide proofs of Proposition D.1 and Corollary D.2.

Proof of Proposition D.1:

Proof. The result obtains from recognizing where the replacement method approximation agrees with the forward process, using conditional independences in both processes, and applying the chain rule for KL divergences. We make this explicit in the derivation below, with comments explaining the transition to the following line.// By the chain rule of probability.= q 1:T (xM ) // By the agreement of q and p Repl on the motif, and the chain rule of probability.and the assumption that p θ matches q.= q 1:T (x

Proof of Corollary D.2:

The proof of the corollary relies of on a lemma on the variances of the two relevant conditional distributions. We state this lemma, whose proof is at the end of the section, before continuing. For notational simplicity, we drop the scripts and annotations on ᾱ(t) and β (t+1) , and instead write α and β, respectively. Now we provide a proof of Corollary D.2.Proof. First recall thathanded helices are especially problematic because they cannot be corrected simply by reflecting the coordinates. As such, it is important to identify and separate samples with mixed chirality.To detect chiralty, we compute the dihedral angle between four consecutive C-α atoms as a chiral metric to distinguish between the two helix chiralities. Algorithmically, for every C-α i, we calculate the dihedral between C-α i, i+1, i+2, and i+3. C-α i with dihedral angles between 0.6 and 1.2 radians are classified as right-handed helices, and angles between -1.2 and -0.6 are classified as left-handed helices, with everything else classified as non-helical. Because C-α atoms in native helices tend to form contiguous stretches longer than one residue in the primary sequence, helical stretches less than one amino acid were removed. This filtering is meant to help avoid accidentally counting the occasional isolated backbone geometry that falls into a helical bin as a true helix. Finally, for all C-α atoms i that are still categorized as part of a helix, the associated i+1, i+2 and i+3 C-α atoms are also counted as part of that helix.

H ADDITIONAL EXPERIMENTAL RESULTS

In this section, we describe additional results to complement the main text. We provide a description of the motif targets in Section 4, along with results of a scaffolding failure case in Appendix H.1. We here provide additional details of the motif-scaffolding experiments described in Section 5. Table 1 specifies the total lengths, motif sizes, and motif indices of our test cases. In Figure 7 we depict the structures of the native proteins (6exz and 5trv) from which the motifs examined quantitatively in the main text were extracted. Figure 8 analyzes commonly observed failure modes of ProtDiff backbone samples involving chain breaks, steric clashes, and incorrect chirality.Figure 9 presents quantitative results on a harder inpainting target. In this case, the motif is defined as residues 163-181 of chain A of respiratory syncytial virus (RSV) protein (PDB ID: 5tpn). We

H.3 ADDITIONAL LATENT INTERPOLATION RESULTS

We here provide additional latent interpolations. Figures 11 and 12 depict interpolations for between model samples for lengths 89 and 63, respectively. 

