PROTEIN SEQUENCE AND STRUCTURE CO-DESIGN WITH EQUIVARIANT TRANSLATION

Abstract

Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods.

1. INTRODUCTION

Proteins are macromolecules that mediate the fundamental processes of all living organisms. For decades, people are seeking to design novel proteins with desired properties (Huang et al., 2016) , a problem known as de novo protein design. Nevertheless, the problem is very challenging due to the tremendous search space of both sequence and structure, and the most well-established approaches still rely on hand-crafted energy functions and heuristic sampling algorithms (Leaver-Fay et al., 2013; Alford et al., 2017) , which are prone to arriving at suboptimal solutions and are computationally intensive and time-consuming. Recently, machine learning approaches have demonstrated impressive performance on different aspects of protein design, and significant progress has been made (Gao et al., 2020) . Most approaches use deep generative models to design protein sequences based on corresponding structures (Ingraham et al., 2019; Jing et al., 2021; Hsu et al., 2022) . Despite their great potential for protein design, the structures of proteins to be engineered are often unknown (Fischman & Ofran, 2018), which hinders the application of these methods. Therefore, efforts have been made to develop models that co-design the sequence and structure of proteins (Anishchenko et al., 2021; Wang et al., 2021) . As a pioneering work, Jin et al. (2021) propose an autoregressive model that co-designs the Complementarity Determining Regions (CDRs) sequence and structure of antibodies based on iterative refinement of protein structures, which spurs a lot of follow-up works (Luo et al., 2022; Kong et al., 2022) . Nevertheless, these approaches are tailored for antibodies and their effectiveness remains unclear on proteins with arbitrary domain topologies (Anand & Achim, 2022) . In addition, they often suffer from high inference costs due to autoregressive sampling or annealed diffusion sampling (Song & Ermon, 2019; Luo et al., 2022) . Very recently, Anand & Achim (2022) propose another diffusion-based generative model (Ho et al., 2020) for general protein sequence-structure co-design, where they 1

