FOLD2SEQ: A JOINT SEQUENCE(1D)-FOLD(3D) EMBEDDING-BASED GENERATIVE MODEL FOR PRO-TEIN DESIGN

Abstract

Designing novel protein sequences consistent with a desired 3D structure or fold, often referred to as the inverse protein folding problem, is a central, but nontrivial, task in protein engineering. It has a wide range of applications in energy, biomedicine, and materials science. However, challenges exist due to the complex sequence-fold relationship and difficulties associated with modeling 3D folds. To overcome these challenges, we propose Fold2Seq, a novel transformerbased generative framework for designing protein sequences conditioned on a specific fold. Our model learns a fold embedding from the density of the secondary structural elements in 3D voxels, and then models the complex sequence-structure relationship by learning a joint sequence-fold embedding. Experiments on highresolution, complete, and single-structure test set demonstrate improved performance of Fold2Seq in terms of speed and reliability for sequence design, compared to existing baselines including the state-of-the-art RosettaDesign and other neural net-based approaches. The unique advantages of fold-based Fold2Seq becomes more evident on diverse real-world test sets comprised of low-resolution, incomplete, or ensemble structures, in comparison to a structure-based model.

1. INTRODUCTION

Computational protein design is the conceptual inverse of the protein structure prediction problem, and aims to infer an amino acid sequence that will fold into a given 3D structure. Designing protein sequences that will fold into a desired structure has a broad range of applications, from therapeutics to materials (Kraemer-Pecore et al., 2001) . Despite significant advancements in methodologies as well as in computing power, inverse protein design still remains challenging, primarily due to the vast size of the sequence space -and the difficulty of learning a function that maps from the 3D structure space to the sequence space. Earlier works rely mostly on energy minimization-based approaches (Koga et al., 2012; Rocklin et al., 2017; Huang et al., 2011) , which follow a scoring function (force fields, statistical potentials, or machine learning (ML) models,) and sample both sequence and conformational space. Such methods often suffer from drawbacks such as low accuracy of energy functions or force-fields (Khan & Vihinen, 2010) and low efficiency in sequence and conformational search (Koga et al., 2012) . Recently, as the data on both protein sequences (hundreds of millions) and structures (a few hundreds of thousands) is quickly accumulating, data-driven approaches for inverse protein design are rapidly emerging (Greener et al., 2018; Karimi et al., 2020; Ingraham et al., 2019) . Generally, data-driven protein design, attempts to model the probability distribution over sequences conditioned on the structures: P (x|y), where x and y are protein sequences and structures, respectively. Two key challenges remain: (1) defining a good representation (y) of the protein structure and (2) modelling the sequence generation process conditioned on y. Current protein design methods use protein backbone information from a single protein structure (fixed backbone) or from a set of backbone structures consistent with a single fold (flexible backbone). In earlier studies, the protein structures are represented as a 1D string (Greener et al., 2018 ), a 1D vector (Karimi et al., 2020 ), a 2D image (Strokach et al., 2020) , or a graph (Ingraham et al., 2019) . The sequence generation methods used in the protein design studies can be classified as non-autoregressive (Karimi et al., 2020; Greener et al., 2018; Strokach et al., 2020) and autoregeressive (Ingraham et al., 2019; Madani et al., 2020) .

