CONDITIONAL GENERATIVE MODELING FOR DE NOVO HIERARCHICAL MULTI-LABEL FUNCTIONAL PROTEIN DESIGN Anonymous

Abstract

The availability of vast protein sequence information and rich functional annotations thereof has a large potential for protein design applications in biomedicine and synthetic biology. To this date, there exists no method for the general-purpose design of proteins without any prior knowledge about the protein of interest, such as costly and rare structure information or seed sequence fragments. However, the Gene Ontology (GO) database provides information about the hierarchical organisation of protein functions, and thus could inform generative models about the underlying complex sequence-function relationships, replacing the need for structural data. We therefore propose to use conditional generative adversarial networks (cGANs) on the task of fast de novo hierarchical multi-label protein design. We generate protein sequences exhibiting properties of a large set of molecular functions extracted from the GO database, using a single model and without any prior information. We shed light on efficient conditioning mechanisms and adapted network architectures thanks to a thorough hyperparameter selection process and analysis. We further provide statistically-and biologically-driven evaluation measures for generative models in the context of protein design to assess the quality of the generated sequences and facilitate progress in the field. We show that our proposed model, ProteoGAN, outperforms several baselines when designing proteins given a functional label and generates well-formed sequences.

1. INTRODUCTION

Designing proteins with a target biological function is an important task in biotechnology with highimpact implications in pharmaceutical research, such as in drug design or synthetic biology (Huang et al., 2016) . However, the task is challenging since the sequence-structure-function relationship of proteins is extremely complex and not yet understood (Dill & MacCallum, 2012) . Functional protein design is currently done by traditional methods such as directed evolution (Arnold, 1998) , which rely on a few random mutations of known proteins and selective pressure to explore a space of related proteins. However, this process can be time-consuming and cost-intensive, and most often only explores a small part of the sequence space. In parallel, data characterizing proteins and their functions is readily available and constitutes a promising opportunity for machine learning applications in protein sequence design. Moreover, the hierarchical organisation of protein functions in a complex ontology of labels could help machine learning models capture sequence-information relationships adequately. Recently, generative models have attempted to design proteins for different tasks, such as developing new therapies (Muller et al., 2018; Davidsen et al., 2019) or enzymes (Repecka et al., 2019) . Nonetheless, most of the de novo protein sequence design methods, which generate sequences from scratch, focus on a specific function or on families of short proteins. Instead, we would like to focus on modeling several different biological functions at the same time to eventually be able to freely combine them. To this end, one first requires a model that is able to deal with and to understand the inherent label structure. We concern ourselves with the development of such a generative model. In this work, we introduce the general-purpose generative model ProteoGAN, a conditional generative adversarial network (cGAN) that is able to generate protein sequences given a large set of functions in the Gene Ontology (GO) Molecular Function directed acyclic graph (DAG) (Gene On-tology Consortium, 2019). To the extent of our knowledge, we are the first to propose a hierarchical multi-label de novo protein design framework, which does not require prior knowledge about the protein, such as seed sequence fragments or structure. Our contributions can be summarized as follows: (i) we propose a data-driven approach to de novo functional protein generation that leverages a large set of annotated sequences, (ii) we present a new extensive evaluation scheme to assess validity, conditional consistency, diversity, and biological relevance of the generated sequences, and (iii) we conduct an in-depth model optimization to derive actionable insights on architectural choices and efficient conditioning mechanisms while outperforming existing state-of-the-art protein generators. We focus on generative adversarial networks, due to their promising performance on specific sequence design tasks (Repecka et al., 2019) . We choose a conditional setting not to rely on oracles nor on multiple rounds of training-generation-measurement, since to this date a well performing general-purpose predictor of protein function remains elusive (Zhou et al., 2019) . As opposed to most existing methods (see Section 2), we aim to generate a comprehensive variety of proteins exhibiting a wide range of functions, rather than focusing on optimising a single function within a unique protein family. As this is a different task from the ones found in the literature, we need to define an adequate evaluation pipeline. Therefore, we establish a multiclass protein generation evaluation scheme centered around validity and conditional consistency. The model should generate protein sequences whose distribution resembles that of natural proteins and hence have similar chemo-physical properties, and it should do so conditionally, namely generating proteins of a given functional class without off-target functions. We are hence confronted with the problem of assessing i) the performance of the generative model in a general sense, which is defined by how well the generated distribution fits the training data distribution, and ii) the conditional performance of the model which we define as a special case of the general performance, where we compare sequence feature distributions between labels. We therefore require distribution-based evaluations. A natural choice to evaluate the performance of a generative model is a two-sample test, which allows to answer whether a generated and a real set of samples (i.e. the dataset) could originate from the same distribution. The difficulty here is to define a measure that can handle the structured data, in our case protein sequences. To this end, we design Maximum Mean Discrepancy (MMD)-based evaluation criteria (Gretton et al., 2012) , which ensure good model performance and a functioning conditioning mechanism by measuring differences in empirical distribution between sets of generated and real protein sequences. To ensure diversity, we monitor the duality gap (Grnarova et al., 2019) , a domain-agnostic indicator for GAN training. Lastly, we use a series of biologically-driven criteria in the evaluation phase that confirms the biological validity of the generated protein by relying on the standard protein feature software ProFET (Ofer & Linial, 2015) . With this arsenal of measures, and given the low computational complexity of our MMD-based criteria, we compare different architectural choices and hyperparameters in an extensive and efficient Bayesian Optimization and HyperBand (BOHB) (Falkner et al., 2018) search. In particular, we develop improved variants of two existing conditional mechanisms on GANs (Odena et al., 2017; Miyato & Koyama, 2018 ) and show for the first time that the previously unexplored combination of both is beneficial to conditional generation. Moreover, the selected model outperforms (i) de novo conditional model CVAE (Greener et al., 2018) , repurposed and trained towards functional protein generation, other introduced baselines (HMM, n-gram model), and (ii) models specifically built to challenge the necessity of a conditional mechanism. The remainder of the document is organized as follows. First, the background and related work section gives a concise overview of the biological mechanisms underlying the function of proteins, summarises the state-of-the-art generative models applied to protein design, details some conditional mechanisms in GANs and identifies existing evaluation criteria for GANs and cGANs. Subsequently, the method section describes ProteoGAN and its components and explains our protein generation evaluation framework. Finally, the results obtained by conditioning the generation of new sequences on 50 GO classes are presented and discussed before concluding with some final remarks.

