FOLD2SEQ: A JOINT SEQUENCE(1D)-FOLD(3D) EMBEDDING-BASED GENERATIVE MODEL FOR PRO-TEIN DESIGN

Abstract

Designing novel protein sequences consistent with a desired 3D structure or fold, often referred to as the inverse protein folding problem, is a central, but nontrivial, task in protein engineering. It has a wide range of applications in energy, biomedicine, and materials science. However, challenges exist due to the complex sequence-fold relationship and difficulties associated with modeling 3D folds. To overcome these challenges, we propose Fold2Seq, a novel transformerbased generative framework for designing protein sequences conditioned on a specific fold. Our model learns a fold embedding from the density of the secondary structural elements in 3D voxels, and then models the complex sequence-structure relationship by learning a joint sequence-fold embedding. Experiments on highresolution, complete, and single-structure test set demonstrate improved performance of Fold2Seq in terms of speed and reliability for sequence design, compared to existing baselines including the state-of-the-art RosettaDesign and other neural net-based approaches. The unique advantages of fold-based Fold2Seq becomes more evident on diverse real-world test sets comprised of low-resolution, incomplete, or ensemble structures, in comparison to a structure-based model.

1. INTRODUCTION

Computational protein design is the conceptual inverse of the protein structure prediction problem, and aims to infer an amino acid sequence that will fold into a given 3D structure. Designing protein sequences that will fold into a desired structure has a broad range of applications, from therapeutics to materials (Kraemer-Pecore et al., 2001) . Despite significant advancements in methodologies as well as in computing power, inverse protein design still remains challenging, primarily due to the vast size of the sequence space -and the difficulty of learning a function that maps from the 3D structure space to the sequence space. Earlier works rely mostly on energy minimization-based approaches (Koga et al., 2012; Rocklin et al., 2017; Huang et al., 2011) , which follow a scoring function (force fields, statistical potentials, or machine learning (ML) models,) and sample both sequence and conformational space. Such methods often suffer from drawbacks such as low accuracy of energy functions or force-fields (Khan & Vihinen, 2010) and low efficiency in sequence and conformational search (Koga et al., 2012) . Recently, as the data on both protein sequences (hundreds of millions) and structures (a few hundreds of thousands) is quickly accumulating, data-driven approaches for inverse protein design are rapidly emerging (Greener et al., 2018; Karimi et al., 2020; Ingraham et al., 2019) . Generally, data-driven protein design, attempts to model the probability distribution over sequences conditioned on the structures: P (x|y), where x and y are protein sequences and structures, respectively. Two key challenges remain: (1) defining a good representation (y) of the protein structure and (2) modelling the sequence generation process conditioned on y. Current protein design methods use protein backbone information from a single protein structure (fixed backbone) or from a set of backbone structures consistent with a single fold (flexible backbone). In earlier studies, the protein structures are represented as a 1D string (Greener et al., 2018 ), a 1D vector (Karimi et al., 2020) , a 2D image (Strokach et al., 2020) , or a graph (Ingraham et al., 2019) . The sequence generation methods used in the protein design studies can be classified as non-autoregressive (Karimi et al., 2020; Greener et al., 2018; Strokach et al., 2020) and autoregeressive (Ingraham et al., 2019; Madani et al., 2020) . In non-autoregerssive methods, y is usually concatenated with a Gaussian random noise z (which is the latent vector of the sequence) please check here to be the input to a sequence generator P (x|y) = f g (y, z), while in autoregressive methods, P (x|y) is decomposed through the chain rule: P (x|y) = n i=1 P (x i |x 1 , x 2 , ..., x i-1 , y), where x = (x 1 , x 2 , ..., x n ). In this paper, we focus on designing sequences conditioned on a protein fold. A protein fold is defined as the arrangement (or topology) of the secondary structure elements of the protein relative to each other (Hou et al., 2003) . A secondary structural element can be defined as the three dimensional form of local segments of a protein sequence. Protein folds are therefore necessarily based on sets of structural elements that distinguish domains. As protein structure is inherently hierarchical, the complete native structure can have multiple folds and a fold can be present in many protein structures. A single structure (fixed backbone) or an ensemble of structures (flexible backbone) can be used as representatives of a fold. The ensemble representation is often a better choice, as it captures the protein dynamics. Despite the recent progress in using ML models for protein design, significant gaps still remain in addressing both aforementioned challenges (fold representation and conditional sequence generation). First, the current fold representation methods are either hand-designed, or constrained and do not capture the complete original fold space (See Sec. 2.2 for details), resulting in low generalization capacity or efficiency. Second, the sequence encoding and the fold encoding are learned separately in previous methods, which makes two latent domains heterogeneous. Such heterogeneity across two domains actually increases the difficulty of learning the complex sequence-fold relationship. To fill the aforementioned gaps, the main contributions of this work are as follows: • We propose a novel fold representation, through first representing the 3D structure by the voxels of the density of secondary structures elements (SSEs), and then learning the fold representation through a transformer-based structure encoder. Compared to previous fold representations, this representation has several advantages: first, it preserves the entire spatial information of SSEs. Second, it does not need any pre-defined rules, so that the paramterized fold space is not neither limited or biased toward any particular fold. Third, the representation can be automatically extracted from a given protein structure. Lastly, the density model also loosens the rigidity of structures so that the structural variation and lack of structural information of the protein is better handled. • We employ a novel joint sequence-fold embedding learning framework into the transformer-based auto-encoder model. By learning a joint latent space between sequences and folds, our model, Fold2Seq, mitigates the heterogeneity between two different domains and is able to better capture the sequence-fold relationship, as reflected in the results. • Experiments on standard test sets demonstrate that Fold2Seq has superior performance on perplexity, native sequence recovery rate, and native structure recovery accuracy, when compared to competing methods including the state-of-the-art RosettaDesign and other neural net models. Ablation study shows that this superior performance is directly attributed to our algorithmic innovations. Experiments on real-world test sets further demonstrates the unique practical utility and versatility of Fold2Seq compared to the structure-based baselines.

2. RELATED WORK

Data-driven Protein Design A significant surge of protein design studies that deeply exploit the data through modern artificial intelligence algorithms has been witnessed in the last two years. 



Greener et al. (2018) used the conditional variational autoencoder for generating protein sequences conditioned on a given fold. Karimi et al. (2020) developed a guided conditional Wasserstein Generative Adversarial Networks (gcWGAN) also for inverse fold design. Madani et al. (2020) trained an extreme large (1.2B parameters) language model conditioned on taxonomic and keyword tags such as molecular functions for generating protein sequences. Ingraham et al. (2019) developed a graph-based transformer for generating protein sequences conditioned on a fixed backbone. Lastly, Strokach et al. (2020) formulated the inverse protein design as a constraint satisfaction problem (CSP) and applied the graph neural networks for generating protein sequences conditioned on the residue-residue distance map that is a static representation of the structure.

