NEURAL REPRESENTATION AND GENERATION FOR RNA SECONDARY STRUCTURES

Abstract

Our work is concerned with the generation and targeted design of RNA, a type of genetic macromolecule that can adopt complex structures which influence their cellular activities and functions. The design of large scale and complex biological structures spurs dedicated graph-based deep generative modeling techniques, which represents a key but underappreciated aspect of computational drug discovery. In this work, we investigate the principles behind representing and generating different RNA structural modalities, and propose a flexible framework to jointly embed and generate these molecular structures along with their sequence in a meaningful latent space. Equipped with a deep understanding of RNA molecular structures, our most sophisticated encoding and decoding methods operate on the molecular graph as well as the junction tree hierarchy, integrating strong inductive bias about RNA structural regularity and folding mechanism such that high structural validity, stability and diversity of generated RNAs are achieved. Also, we seek to adequately organize the latent space of RNA molecular embeddings with regard to the interaction with proteins, and targeted optimization is used to navigate in this latent space to search for desired novel RNA molecules.

1. INTRODUCTION

There is an increasing interest in developing deep generative models for biochemical data, especially in the context of generating drug-like molecules. Learning generative models of biochemical molecules can facilitate the development and discovery of novel treatments for various diseases, reducing the lead time for discovering promising new therapies and potentially translating in reduced costs for drug development (Stokes et al., 2020) . Indeed, the study of generative models for molecules has become a rich and active subfield within machine learning, with standard benchmarks (Sterling & Irwin, 2015) , a set of well-known baseline approaches (Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Liu et al., 2018; Jin et al., 2018) , and high-profile cases of real-world impactfoot_0 . Prior work in this space has focused primarily on the generation of small molecules (with less than 100 atoms), leaving the development of generative models for larger and more complicated biologics and biosimilar drugs (e.g., RNA and protein peptides) an open area for research. Developing generative models for larger biochemicals is critical in order to expand the frontiers of automated treatment design. More generally, developing effective representation learning for such complex biochemicals will allow machine learning systems to integrate knowledge and interactions involving these biologically-rich structures. In this work, we take a first step towards the development of deep generative models for complex biomolecules, focusing on the representation and generation of RNA structures. RNA plays a crucial Published a conference paper at ICLR 2021 role in protein transcription and various regulatory processes within cells which can be influenced by its structure (Crick, 1970; Stefl et al., 2005) , and RNA-based therapies are an increasingly active area of research (Pardi et al., 2018; Schlake et al., 2012) , making it a natural focus for the development of deep generative models. The key challenge in generating RNA molecules-compared to the generation of small molecules-is that RNA involves a hierarchical, multi-scale structure, including a primary sequential structure based on the sequence of nucleic acids as well as more complex secondary and tertiary structures based on the way that the RNA strand folds onto itself. An effective generative model for RNA must be able to generate sequences that give rise to these more complex emergent structures. There have been prior works on optimizing or designing RNA sequences-using reinforcement learning or blackbox optimization-to generate particular RNA secondary structures (Runge et al., 2019; Churkin et al., 2017) . However, these prior works generally focus on optimizing sequences to conform to a specific secondary structure. In contrast, our goal is to define a generative model, which can facilitate the sampling and generation of diverse RNA molecules with meaningful secondary structures, while also providing a novel avenue for targeted RNA design via search over a tractable latent space. Key contributions. We propose a series of benchmark tasks and deep generative models for the task of RNA generation, with the goal of facilitating future work on this important and challenging problem. We propose three interrelated benchmark tasks for RNA representation and generation: 1. Unsupervised generation: Generating stable, valid, and diverse RNAs that exhibit complex secondary structures. 2. Semi-supervised learning: Learning latent representations of RNA structure that correlate with known RNA functional properties. 3. Targeted generation: Generating RNAs that exhibit particular functional properties. These three tasks build upon each other, with the first task only requiring the generation of stable and valid molecules, while the latter two tasks involve representing and generating RNAs that exhibit particular properties. In addition to proposing these novel benchmarks for the field, we introduce and evaluate three generative models for RNA. All three models build upon variational autoencoders (VAEs) (Kingma & Welling, 2014) augmented with normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016) , and they differ in how they represent the RNA structure. To help readers better understand RNA structures and properties, a self-contained explanation is provided in appendix B. The simplest model (termed LSTMVAE) learns using a string-based representation of RNA structure. The second model (termed GraphVAE) leverages a graph-based representation and graph neural network (GNN) encoder approach (Gilmer et al., 2017) . Finally, the most sophisticated model (termed HierVAE) introduces and leverages a novel hierarchical decomposition of the RNA structure. Extensive experiments on our newly proposed benchmarks highlight how the hierarchical approach allows more effective representation and generation of complex RNA structures, while also highlighting important challenges for future work in the area.

2. TASK DESCRIPTION

Given a dataset of RNA molecules, i.e. sequences of nucleotides and corresponding secondary structures, our goals are to: (a) learn to generate structurally stable, diverse, and valid RNA molecules that reflect the distribution in this training dataset; (b) learn latent representations that reflect the functional properties of RNA. A key factor in both these representation and generation processes is that we seek to jointly represent and generate both the primary sequence structure as well as the secondary structure conformation. Together, these two goals lay the foundations for generating novel RNAs that satisfy certain functional properties. To meet these goals, we create two types of benchmark datasets, each one focusing on one aspect of the above mentioned goals: Unlabeled and variable-length RNA. The first dataset contains unlabeled RNA with moderate and highly-variable length (32-512 nts), obtained from the human transcriptome (Aken et al., 2016) and through which we focus on the generation aspect of structured RNA and evaluate the validity, stability and diversity of generated RNA molecules. In particular, our goal with this dataset is to jointly generate RNA sequences and secondary structures that are biochemically feasible (i.e., valid), have



e.g. LambdaZero project for exascale search of drug-like molecules.

