DIFFUSION PROBABILISTIC MODELING OF PROTEIN BACKBONES IN 3D FOR THE MOTIF-SCAFFOLDING PROBLEM

Abstract

Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the largecompute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.

1. INTRODUCTION

A central task in protein design is creation of a stable scaffold to support a target motif. Here, motifs are structural protein fragments imparting biological function while scaffolds stabilize the motif's structure. Vaccines and enzymes have already been designed by solving certain instances of this motif-scaffolding problem (Procko et al., 2014; Correia et al., 2014; Jiang et al., 2008; Siegel et al., 2010) . However, successful solutions to this problem in the past have necessitated substantial expert involvement and laborious trial and error. Machine learning (ML) offers the hope to automate, and better direct this search. But existing ML approaches face one of two major roadblocks. First, many methods do not build scaffolds longer than about 20 residues. For many motif sizes of interest, the resulting proteins would be smaller than the shortest commonly-studied simple protein folds (35-40 residues) (Gelman & Gruebele, 2014). Second, while other methods may generate longer scaffolds using stochastic search techniques, they require hours of computation to generate a single plausible scaffold (Wang et al., 2022; Anishchenko et al., 2021; Tischer et al., 2020) . Moreover, when a plausible scaffold is found, it remains to be experimentally validated. Therefore, it is desirable to return not just a single scaffold but rather a set of scaffolds exhibiting diverse sequences and structural variation to increase the likelihood of success in practice. In the present work, we demonstrate the promise of a particular generative modeling approach within ML for efficiently returning a diverse set of motif-supporting scaffolds. Generative models have been shown to capture a distribution over diverse protein structures (Lin et al., 2021) . But it is not clear how to handle conditioning (on the motif) using these approaches. Diffusion probabilistic models (DPMs) offer a potential alternative; not only do they provide a more straightforward path to handling conditioning, but they have also enjoyed success generating small-molecules in 3D (Hoogeboom

