PROTEIN SEQUENCE AND STRUCTURE CO-DESIGN WITH EQUIVARIANT TRANSLATION

Abstract

Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods.

1. INTRODUCTION

Proteins are macromolecules that mediate the fundamental processes of all living organisms. For decades, people are seeking to design novel proteins with desired properties (Huang et al., 2016) , a problem known as de novo protein design. Nevertheless, the problem is very challenging due to the tremendous search space of both sequence and structure, and the most well-established approaches still rely on hand-crafted energy functions and heuristic sampling algorithms (Leaver-Fay et al., 2013; Alford et al., 2017) , which are prone to arriving at suboptimal solutions and are computationally intensive and time-consuming. Recently, machine learning approaches have demonstrated impressive performance on different aspects of protein design, and significant progress has been made (Gao et al., 2020) . Most approaches use deep generative models to design protein sequences based on corresponding structures (Ingraham et al., 2019; Jing et al., 2021; Hsu et al., 2022) . Despite their great potential for protein design, the structures of proteins to be engineered are often unknown (Fischman & Ofran, 2018) , which hinders the application of these methods. Therefore, efforts have been made to develop models that co-design the sequence and structure of proteins (Anishchenko et al., 2021; Wang et al., 2021) . As a pioneering work, Jin et al. (2021) propose an autoregressive model that co-designs the Complementarity Determining Regions (CDRs) sequence and structure of antibodies based on iterative refinement of protein structures, which spurs a lot of follow-up works (Luo et al., 2022; Kong et al., 2022) . Nevertheless, these approaches are tailored for antibodies and their effectiveness remains unclear on proteins with arbitrary domain topologies (Anand & Achim, 2022) . In addition, they often suffer from high inference costs due to autoregressive sampling or annealed diffusion sampling (Song & Ermon, 2019; Luo et al., 2022) . Very recently, Anand & Achim (2022) propose another diffusion-based generative model (Ho et al., 2020) for general protein sequence-structure co-design, where they adopt three diffusion models to generate structures, sequences, and rotamers of proteins in sequential order. Although applicable to proteins of all topologies, such a sequential generation strategy fails to cross-condition on sequence and structure, which might lead to inconsistent proteins. Besides, the inference process is also expensive due to the use of three separate diffusion processes. Conditional Structure Sequence Co-Design C C C H H H H H H H … … … Sequences SS annotation Contact map Structure Antigen-specific CDR Co-Design G S Y Y G T L D F <X><X><X><X>… L T S Y G Y F D Y G Y Y Y G Y F D Y … … CDR sequences (a) (b) (c) CDR structures Masked CDR loop Antibody framework Antigen ? Fixed Backbone Sequence Design G W S T E L E K H … M Y S R R L L Q H … A Y S D E Q L E K … G W S T E L E K H … M Y S R R L L Q H … A Y S D E Q L E K … … … Backbone structure Sequences To address the aforementioned issues, in this paper, we propose a new method capable of protein sequence-structure equivariant co-design called PROTSEED. Specifically, we formulate the co-design task as a translation problem in the joint sequence-structure space based on context features. Here the context features represent prior knowledge encoding constraints that biologists want to impose on the protein to be designed (Dou et al., 2018; Shen et al., 2018) . As an illustration, we present three protein design tasks with different given context features in Figure 1 . Our PROTSEED consists of a trigonometry-aware encoder that infers geometrical constraints and prior knowledge for protein design from context features, and a novel roto-translation equivariant decoder that iteratively translates proteins into desired states in an end-to-end and equivariant manner. The equivariance property with respect to protein structures during the whole process is guaranteed by predicting structure updates in local frames based on invariant representations, and then transforming them into global frames using change of basis operation. It is worth mentioning that PROTSEED updates sequence and structure of all residues in an one-shot manner, leading to a much more efficient inference process. In contrast to previous method that first generates structure and then generates sequence and rotamers, we allow the model to cross-condition on sequence and structure, and encourage the maximal information flow among context features, sequences, and structures, which ensure the fidelity of generated proteins. We conduct extensive experiments on the Structural Antibody Database (SAbDab) (Dunbar et al., 2014) as well as two protein design benchmark data sets curated from CATH (Orengo et al., 1997) , and compare PROTSEED against previous state-of-the-art methods on multiple tasks, ranging from antigen-specific antibody CDR design to context-conditioned protein design and fixed backbone protein design. Numerical results show that our method significantly outperforms previous baselines and can generate high fidelity proteins in terms of both sequence and structure, while running orders of magnitude faster than sampling-based methods. As a proof of concept, we also show by cases that PROTSEED is able to perform de novo protein design with new folds.

2. RELATED WORK

Protein Design. The most well-established approaches on protein design mainly rely on handcrafted energy functions to iteratively search low-energy protein sequences and conformations with heuristic sampling algorithms (Leaver-Fay et al., 2013; Alford et al., 2017; Tischer et al., 2020) . Nevertheless, these conventional methods are computationally intensive, and are prone to arriving at local optimum due to the complicated energy landscape. Recent advances in deep generative models open the door to data-driven approaches, and a variety of models have been proposed to generate protein sequences (Rives et al., 2021; Shin et al., 2021; Ferruz et al., 2022) or backbone structures (Anand & Huang, 2018; Eguchi et al., 2022; Trippe et al., 2022) . To have fine-grain control over designed proteins, methods are developed to predict sequences that can fold into given backbone structures (Ingraham et al., 2019; Jing et al., 2021; Anand et al., 2022; Dauparas et al., 2022) , a.k.a. fixed backbone design, which achieve promising results but require the desired protein structure to be known a priori.



Figure 1: Illustration of three protein design tasks with different context features. (a) Antigen-specific CDR co-design given structure and sequence of antibody framework and binding antigen. (b) Protein sequence-structure co-design conditioned on secondary structure (SS) annotation and binary contact features. (c) Fixed backbone sequence design conditioned on given backbone structures.

