NEURAL REPRESENTATION AND GENERATION FOR RNA SECONDARY STRUCTURES

Abstract

Our work is concerned with the generation and targeted design of RNA, a type of genetic macromolecule that can adopt complex structures which influence their cellular activities and functions. The design of large scale and complex biological structures spurs dedicated graph-based deep generative modeling techniques, which represents a key but underappreciated aspect of computational drug discovery. In this work, we investigate the principles behind representing and generating different RNA structural modalities, and propose a flexible framework to jointly embed and generate these molecular structures along with their sequence in a meaningful latent space. Equipped with a deep understanding of RNA molecular structures, our most sophisticated encoding and decoding methods operate on the molecular graph as well as the junction tree hierarchy, integrating strong inductive bias about RNA structural regularity and folding mechanism such that high structural validity, stability and diversity of generated RNAs are achieved. Also, we seek to adequately organize the latent space of RNA molecular embeddings with regard to the interaction with proteins, and targeted optimization is used to navigate in this latent space to search for desired novel RNA molecules.

1. INTRODUCTION

There is an increasing interest in developing deep generative models for biochemical data, especially in the context of generating drug-like molecules. Learning generative models of biochemical molecules can facilitate the development and discovery of novel treatments for various diseases, reducing the lead time for discovering promising new therapies and potentially translating in reduced costs for drug development (Stokes et al., 2020) . Indeed, the study of generative models for molecules has become a rich and active subfield within machine learning, with standard benchmarks (Sterling & Irwin, 2015) , a set of well-known baseline approaches (Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Liu et al., 2018; Jin et al., 2018) , and high-profile cases of real-world impactfoot_0 . Prior work in this space has focused primarily on the generation of small molecules (with less than 100 atoms), leaving the development of generative models for larger and more complicated biologics and biosimilar drugs (e.g., RNA and protein peptides) an open area for research. Developing generative models for larger biochemicals is critical in order to expand the frontiers of automated treatment design. More generally, developing effective representation learning for such complex biochemicals will allow machine learning systems to integrate knowledge and interactions involving these biologically-rich structures. In this work, we take a first step towards the development of deep generative models for complex biomolecules, focusing on the representation and generation of RNA structures. RNA plays a crucial



e.g. LambdaZero project for exascale search of drug-like molecules.1

