PROGAE: A GEOMETRIC AUTOENCODER-BASED GENERATIVE MODEL FOR DISENTANGLING PROTEIN CONFORMATIONAL SPACE

Abstract

Understanding the protein conformational landscape is critical, as protein function, as well as modulations thereof due to ligand binding or changes in environment, are intimately connected with structural variations. This work focuses on learning a generative neural network on a simulated ensemble of protein structures obtained using molecular simulation to characterize the distinct structural fluctuations of a protein bound to various drug molecules. Specifically, we use a geometric autoencoder framework to learn separate latent space encodings of the intrinsic and extrinsic geometries of the system. For this purpose, the proposed Protein Geometric AutoEncoder (ProGAE) model is trained on the length of the alpha-carbon pseudobonds and the orientation of the backbone bonds of the protein. Using ProGAE latent embeddings, we reconstruct and generate the conformational ensemble of a protein at or near the experimental resolution. Empowered by the disentangled latent space learning, the intrinsic latent embedding help in geometric error correction, whereas the extrinsic latent embedding is successfully used for classification or property prediction of different drugs bound to a specific protein. Additionally, ProGAE is able to be transferred to the structures of a different state of the same protein or to a completely different protein of different size, where only the dense layer decoding from the latent representation needs to be retrained. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations, charting the path toward scalable and improved approaches for analyzing and enhancing molecular simulations.

1. INTRODUCTION

The complex and time-consuming calculations in molecular simulations have been significantly impacted by the application of machine learning techniques in recent years. In particular, deep learning has been applied to analysis and simulation of molecular trajectories to address diverse problems, such as estimating free energy surfaces, defining optimal reaction coordinates, constructing Markov State Models, and enhancing molecular sampling. For a comprehensive review of deep learning methods for analyzing and enhancing molecular simulations, see (Noé et al., 2020a) and (Noé et al., 2020b) . Specifically, there has been interest in modeling the underlying conformational space of proteins by using deep generative models, e.g. (Ramaswamy et al., 2020) and (Bhowmik et al., 2018; Guo et al., 2020; Varolgünes ¸et al., 2020) . This line of work has mainly attempted to respect the domain geometry by using convolutional AEs on features extracted from 3D structures. In parallel, learning directly from 3D structure has recently developed into an exciting and promising application area for deep learning. In this work, we learn the protein conformational space from a set of protein simulations using geometric deep learning. We also investigate how the geometry of a protein itself can assist learning and improve latent conformational space interpretability. Namely, we consider the influence of intrinsic and extrinsic geometry, where intrinsic geometry is independent of 3D embedding and extrinsic is not. Intrinsic geometric protein properties can be thought to be robust to conformation. To this end, we propose a Protein Geometric Autoencoder model, named ProGAE, to separately encode intrinsic and extrinsic protein geometries. • Inspired by recent unsupervised geometric disentanglement learning works (Tatro et al., 2020; Wu et al., 2019; Yang et al., 2020) , we propose a novel geometric autoencoder named ProGAE that directly learns from 3D protein structures via separately encoding intrinsic and extrinsic geometries into disjoint latent spaces used to generate protein structures. • We further propose a novel formulation, in which network intrinsic input is taken as the C α -C α pseudo-bond distances, and the extrinsic input is the backbone bond orientations. • Analysis shows that the learned extrinsic geometric latent space can be used for drug classification and drug property prediction, where the drug is bound to the given protein. • We find that the intrinsic geometric latent space, even with small variation in the intrinsic input signal, is important for reducing geometric errors in reconstructed proteins. • We also demonstrate that the learned ProGAE can be transferred to a trajectory of the protein in a different state or a trajectory of a different protein all-together. 1 2020) leverages the notion of intrinsic and extrinsic geometry to define an architecture for a fold classification task. Additionally, there has been focus on directly learning the temporal aspects of molecular dynamics from simulation trajectories, which is not directly related to the current work. Please see Appendix A.1 for a detailed discussion. There is an existing body of recent works that use AE-based approaches for either analyzing and/or generating structures from from the latent space (Bhowmik et al., 2018; Guo et al., 2020; Ramaswamy et al., 2020; Varolgünes ¸et al., 2020) , which are most closely related to this work. (Bhowmik et al., 2018) and (Guo et al., 2020) aim at learning from and generating protein contact maps, while ProGAE directly deals with 3D structures. Therefore a direct comparison of ProGAE with these methods is not possible. Ramaswamy et al. ( 2019) uses a 1D CNN autoencoder trained on backbone coordinates and uses a loss objective comprised of geometric MSE error and physicsbased (bond length, bond angle, etc.) error. Due to the unavailability of code or pre-trained model, we were unable to perform a direct comparison. Varolgünes ¸et al. (2020) uses a VAE with a Gaussian Mixture Prior for performing clustering of high-dimensional input configurations in the learned latent space. While the method works well on toy models and a standard Alanine Dipeptide benchmark, its performance drops as the size of the protein system grows to 15 amino acids, which is approximately an order smaller than the protein systems studied here. Also, their approach is likely not going to scale well to larger systems due to the use of fully-connected layers in the encoder. These mentioned works have not considered explicit disentangling of intrinsic and extrinsic geometries. To our knowledge, this work is the first to propose an autoencoder for the unsupervised modeling of the geometric disentanglement of protein conformational space captured in molecular simulations. This representation provides better interpretability of the latent space, in terms of the physico-chemical and geometric attributes, results in more geometrically accurate protein conformations, as well as scales and transfers well to larger protein systems.

2. PROGAE FOR PROTEIN CONFORMATIONAL SPACE

First, we introduce the input signal for our novel geometric autoencoder, ProGAE. We then discuss how ProGAE utilizes this signal to generate the conformational space of a protein. Geometric Features of Protein as Network Input ProGAE functions by separately encoding intrinsic and extrinsic geometry with the goal of achieving better latent space interpretability. We

