CONDITIONAL INVARIANCES FOR CONFORMER IN-VARIANT PROTEIN REPRESENTATIONS

Abstract

Representation learning for proteins is an emerging area in geometric deep learning. Recent works have factored in both the relational (atomic bonds) and the geometric aspects (atomic positions) of the task, notably bringing together graph neural networks (GNNs) with neural networks for point clouds. The equivariances and invariances to geometric transformations (group actions such as rotations and translations) so far treat large molecules as rigid structures. However, in many important settings, proteins can co-exist as an ensemble of multiple stable conformations. The conformations of a protein, however, cannot be described as input-independent transformations of the protein: Two proteins may require different sets of transformations in order to describe their set of viable conformations. To address this limitation, we introduce the concept of conditional transformations (CT). CT can capture protein structure, while respecting the constraints on dihedral (torsion) angles and steric repulsions between atoms. We then introduce a Markov chain Monte Carlo framework to learn representations that are invariant to these conditional transformations. Our results show that endowing existing baseline models with these conditional transformations helps improve their performance without sacrificing computational efficiency.

1. INTRODUCTION

The literature on geometric deep learning has achieved much success with neural networks that explicitly model equivariances (or invariances) to group transformations (Cohen & Welling, 2016; Maron et al., 2018; Kondor & Trivedi, 2018; Finzi et al., 2020) . Among applications to physical sciences, group equivariant graph neural networks and transformers have specifically found applications to small molecules, as well as large molecules (e.g. proteins) with tremendous success (Klicpera et al., 2020; Anderson et al., 2019; Fuchs et al., 2020; Hutchinson et al., 2021; Satorras et al., 2021; Batzner et al., 2021) . Specifically, machine learning for proteins (and 3D macromolecular structures in general) is a rapidly growing application area in geometric deep learning, (Bronstein et al., 2021; Gerken et al., 2021) . Traditionally, proteins have been modeled using standard 3D CNNs (Karimi et al., 2019; Pagès et al., 2019) , graph neural networks (GNNs) (Kipf & Welling, 2016; Hamilton et al., 2017), and transformers (Vaswani et al., 2017) . More recently, several works Jing et al. (2020; 2021); Jumper et al. (2021); Hermosilla et al. (2021) have enriched the above models with neural networks that are equivariant (invariant) to transformations from the Euclidean and rotation groups. While equivariance (invariance) to the transformations in these groups are necessary properties for the model, unfortunately, they are limited to only capture rigid transformations of the input object. However, these models may not yet account for all invariances of the input pertinent to the downstream task. And the transformations they are invariant to do not depend on the input. For instance, invariance to the Euclidean group restricts the protein representation to act as if the protein were a rigid structure, regardless of the protein under consideration. However, treating proteins as rigid structures may not be optimal for many downstream tasks. And different proteins may have different types of conformations (protein 3D structures with flexible side chains) (Harder et al., 2010; Gainza et al., 2012; Miao & Cao, 2016) . The existing rigid body assumption in protein representations may hurt these methods in datasets & downstream tasks that require protein representations to be invariant to a specific set of protein conformations. For example, for most proteins, regardless of their side chain conformation (as long as viable) under consideration -their protein fold class/ other scalar properties remain the same, their mutation (in)stability remains unaltered, protein ligand binding affinity (apart from changes at the ligand binding site) remain the same, etc. In light of this limitation of current methods, a question naturally arises: Is it possible to learn conformer-invariant protein representations? Our Approach: We propose a representation method where the set of symmetries in the task are input-dependent, which we denote conditional invariances. For our specific application, we model every protein as a directed forest, where every amino acid forms a directed tree. We leverage the constructed directed forest of the protein to sample viable protein conformations (where viability is checked with Protein structure validation tools such as Molprobity (Davis et al., 2007; Chen et al., 2010) ), which is additionally coupled with a Markov chain Monte Carlo (MCMC) framework to create data augmentations to train the neural network to learn conformer invariant representations. Our contributions can be summarized as follows: • We provide guiding principles for defining conditional invariant representations as inductive biases, which serves as a generalization of group invariant neural networks. • We then provide a principled strategy to sample conformations for any given protein from the support of its protein conformer distribution (which consists of all its viable protein conformations) which captures their true flexibility. Viable conformations respect domain specific constraints such as those on dihedral angles, steric repulsions, among others. This is particularly useful in the scenario when the set of viable transformations is different across different proteins. • Further, we develop an MCMC-based learning framework which guides the sampling of protein conformations such that the (asymptotic empirical) average representation obtained by all viable conformations of a protein are identical. • Finally, we perform experimental evaluation of our proposal, where endowing baseline models with our proposed strategy (via an implicit data augmentation scheme) shows noticeable improvements on multiple different classification and regression tasks on proteins.

2. CONDITIONAL INVARIANCES FOR PROTEINS

The symmetry of an object is the set of transformations that leaves the object invariant. The notion of symmetries, expressed through the action of a group on functions defined on some domain, has served as a fundamental concept of geometric deep learning. In this section, we start by defining the concept of input-dependent conditional symmetries and then proceed to specifically tailor these conditional symmetries that are both computationally efficient and useful for representing protein conformations. Some group theoretic preliminaries are presented in the Appendix A.2. Definition 2.1 (Conditionally symmetric-invariant functions). A function f : Ω → R, is said to be conditionally symmetric-invariant if f (t x • x) = f (x) , ∀t x ∈ S x , ∀x ∈ Ω, where S x is a set of transformations unique to element x and t x : Ω → Ω. It is easy to see that conditionally invariant functions are a generalization of group invariant functions where S x ≡ G ∀x ∈ Ω where G is a group. The above definition is motivated by the fact that representations for proteins may not necessarily be limited to be invariant just to group actions, but to a more general set of protein specific transformations. We detail this motivation next. A protein molecule is composed of amino acids (say n amino acids), where each atom in the protein belongs to one amino acid ∈ {1, . . . , n}. Excluding the hydrogen atoms, every amino acid contains four atoms known as the backbone atoms (see Figure 1 ), and other atoms which belong to the side chain. Traditionally, protein structures are solved by X-ray crystallography or cryo-EM and the obtained structure is normally considered a unique 3D conformation of the molecule. Molecules available via the Protein Data Bank (Berman et al., 2000) generally include only unique sets of coordinates to demonstrate the 3D structures, which lead to the unique answer as 'gold standard' in structure prediction, such as protein structure prediction competitions -CASP (Kryshtafovych et al., 2019) . Under these assumptions protein side-chain conformations are usually assumed to be clustered into rotamers, which are rigid conformations represented by discrete side-chain dihedral angles.



Figure 1: Magnified image of side chain of a single generic amino acid (here, with 6 atoms in the side chain) in a protein molecule. A protein molecule typically contains tens to hundreds of amino acids

