CONDITIONAL GENERATIVE MODELING FOR DE NOVO HIERARCHICAL MULTI-LABEL FUNCTIONAL PROTEIN DESIGN Anonymous

Abstract

The availability of vast protein sequence information and rich functional annotations thereof has a large potential for protein design applications in biomedicine and synthetic biology. To this date, there exists no method for the general-purpose design of proteins without any prior knowledge about the protein of interest, such as costly and rare structure information or seed sequence fragments. However, the Gene Ontology (GO) database provides information about the hierarchical organisation of protein functions, and thus could inform generative models about the underlying complex sequence-function relationships, replacing the need for structural data. We therefore propose to use conditional generative adversarial networks (cGANs) on the task of fast de novo hierarchical multi-label protein design. We generate protein sequences exhibiting properties of a large set of molecular functions extracted from the GO database, using a single model and without any prior information. We shed light on efficient conditioning mechanisms and adapted network architectures thanks to a thorough hyperparameter selection process and analysis. We further provide statistically-and biologically-driven evaluation measures for generative models in the context of protein design to assess the quality of the generated sequences and facilitate progress in the field. We show that our proposed model, ProteoGAN, outperforms several baselines when designing proteins given a functional label and generates well-formed sequences.

1. INTRODUCTION

Designing proteins with a target biological function is an important task in biotechnology with highimpact implications in pharmaceutical research, such as in drug design or synthetic biology (Huang et al., 2016) . However, the task is challenging since the sequence-structure-function relationship of proteins is extremely complex and not yet understood (Dill & MacCallum, 2012) . Functional protein design is currently done by traditional methods such as directed evolution (Arnold, 1998) , which rely on a few random mutations of known proteins and selective pressure to explore a space of related proteins. However, this process can be time-consuming and cost-intensive, and most often only explores a small part of the sequence space. In parallel, data characterizing proteins and their functions is readily available and constitutes a promising opportunity for machine learning applications in protein sequence design. Moreover, the hierarchical organisation of protein functions in a complex ontology of labels could help machine learning models capture sequence-information relationships adequately. Recently, generative models have attempted to design proteins for different tasks, such as developing new therapies (Muller et al., 2018; Davidsen et al., 2019) or enzymes (Repecka et al., 2019) . Nonetheless, most of the de novo protein sequence design methods, which generate sequences from scratch, focus on a specific function or on families of short proteins. Instead, we would like to focus on modeling several different biological functions at the same time to eventually be able to freely combine them. To this end, one first requires a model that is able to deal with and to understand the inherent label structure. We concern ourselves with the development of such a generative model. In this work, we introduce the general-purpose generative model ProteoGAN, a conditional generative adversarial network (cGAN) that is able to generate protein sequences given a large set of functions in the Gene Ontology (GO) Molecular Function directed acyclic graph (DAG) (Gene On-

