NEURAL POTTS MODEL

Abstract

We propose the Neural Potts Model objective as an amortized optimization problem. The objective enables training a single model with shared parameters to explicitly model energy landscapes across multiple protein families. Given a protein sequence as input, the model is trained to predict a pairwise coupling matrix for a Potts model energy function describing the local evolutionary landscape of the sequence. Couplings can be predicted for novel sequences. A controlled ablation experiment assessing unsupervised contact prediction on sets of related protein families finds a gain from amortization for low-depth multiple sequence alignments; the result is then confirmed on a database with broad coverage of protein sequences.

1. INTRODUCTION

When two positions in a protein sequence are in spatial contact in the folded three-dimensional structure of the protein, evolution is not free to choose the amino acid at each position independently. This means that the positions co-evolve: when the amino acid at one position varies, the assignment at the contacting site may vary with it. A multiple sequence alignment (MSA) summarizes evolutionary variation by collecting a group of diverse but evolutionarily related sequences. Patterns of variation, including co-evolution, can be observed in the MSA. These patterns are in turn associated with the structure and function of the protein (Göbel et al., 1994) . Unsupervised contact prediction aims to detect co-evolutionary patterns in the statistics of the MSA and infer structure from them. The standard method for unsupervised contact prediction fits a Potts model energy function to the MSA (Lapedes et al., 1999; Thomas et al., 2008; Weigt et al., 2009) . Various approximations are used in practice including mean field (Morcos et al., 2011) , sparse inverse covariance estimation (Jones et al., 2011), and pseudolikelihood maximization (Balakrishnan et al., 2011; Ekeberg et al., 2013; Kamisetty et al., 2013) . To construct the MSA for a given input sequence, a similarity query is performed across a large database to identify related sequences, which are then aligned to each other. Fitting the Potts model to the set of sequences identifies statistical couplings between different sites in the protein, which can be used to infer contacts in the structure (Weigt et al., 2009) . Contact prediction performance depends on the depth of the MSA and is reduced when few related sequences are available to fit the model. In this work we consider fitting many models across many families simultaneously with parameter sharing across all the families. We introduce this formally as the Neural Potts Model (NPM) objective. The objective is an amortized optimization problem across sequence families. A Transformer model is trained to predict the parameters of a Potts model energy function defined by the MSA of each input sequence. This approach builds on the ideas in the emerging field of protein language models (Alley et al., 2019; Rives et al., 2019; Heinzinger et al., 2019) , which proposes to fit a single model with unsupervised learning across many evolutionarily diverse protein sequences. We extend this core idea to train a model to output an explicit energy landscape for every sequence. To evaluate the approach, we focus on the problem setting of unsupervised contact prediction for proteins with low-depth MSAs. Unsupervised structure learning with Potts models performs poorly when few related sequences are available (Jones et al., 2011; Kamisetty et al., 2013; Moult et al., 2016) . Since larger protein families are likely to have structures available, the proteins of greatest interest for unsupervised structure prediction are likely to have lower depth MSAs (Tetchner et al., 2014) . This is especially a problem for higher organisms, where there are fewer related genomes (Tetchner et al., 2014) . The hope is that for low-depth MSAs, the parameter sharing in the neural model will improve results relative to fitting an independent Potts model to each family.

MSA Optimize Potts model

Search & align NPM forward pass (a) (b) < l a t e x i t s h a 1 _ b a s e 6 4 = " V p B Y x 8 K B t b t m Q y a n T u T K s J e P Y q 0 = " > A A A C C 3 i c b V C 9 T s M w G H T K X w l / B U a W i A o J M U Q J q k T H S i y M R a K 0 q A 2 V 4 z i t V d u J 7 C + I K u o b s L H C S z A i V h 6 C d + A h c N s M 0 H K S p d P d d / b n C 1 P O N H j e l 1 V a W V 1 b 3 y h v 2 l v b O 7 t 7 l f 2 D W 5 1 k i t A W S X i i O i H W l D N J W 8 C A 0 0 6 q K B Y h p + 1 w d D n 1 2 w 9 U a Z b I G x i n N B B 4 I F n M C A Y j 3 f W G G P L 2 5 P 6 s X 6 l 6 r j e D s 0 z 8 g l R R g W a / 8 t 2 L E p I J K o F w r H X X 9 1 I I c q y A E U 4 n d i / T N M V k h A e 0 a 6 j E g u o g n y 0 8 c U 6 M E j l x o s y R 4 M z U 3 4 k c C 6 3 H I j S T A s N Q L 3 p T 8 T + v m 0 F c D 3 I m 0 w y o J P O H 4 o w 7 k D j T 3 z s R U 5 Q A H x u C i W J m V 4 c M s c I E T E d 2 b x b M p 9 f 2 S S I E l p F 2 g T 5 O b F O P v 1 j G M m m f u 3 7 N 9 f 3 r W r V R L 5 o q o y N 0 j E 6 R j y 5 Q A 1 2 h J m o h g g R 6 R i / o 1 X q y 3 q x 3 6 2 M + W r K K z C H 6 A + v z B w m 3 m 3 c = < / l a t e x i t > Ŵ ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " u j 0 x U A H a O l U i x 9 x E u w Y 6 N 3 / m A O s = " > A A A C D 3 i c b V D L S s N A F J 3 U V 6 2 v q k s 3 w S L o J i R S 0 K X g x m U F a w t N K J P p r R 0 6 M 4 k z N 2 I J / Q d 3 b v U n X I p b P 8 F / 8 C O c p F 3 4 O j B w O O e e O 5 c T p 4 I b 9 P 0 P p 7 K w u L S 8 U l 2 t r a 1 v b G 7 V t 3 e u T Z J p B m 2 W i E R 3 Y 2 p A c A V t 5 C i g m 2 q g M h b Q i c f n h d + 5 A 2 1 4 o q 5 w k k I k 6 Y 3 i Q 8 4 o W i n q 9 E M c A d L D 8 O 7 + q F 9 v + J 5 f w v 1 L g j l p k D l a / f p n O E h Y J k E h E 9 S Y X u C n G O V U I 2 c C p r U w M 5 B S N q Y 3 0 L N U U Q k m y s u j p + 6 B V Q b u M N H 2 K X R L 9 X s i p 9 K Y i Y z t p K Q 4 M r + 9 Q v z P 6 2 U 4 P I 1 y r t I M Q b H Z R 8 N M u J i 4 R Q P u g G t g K C a W U K a 5 v d V l I 6 o p Q 9 t T L S y D e b G 2 z x I p q R o Y D + F + W r P 1 B L / L + E s 6 x 1 7 Q 9 I L g s t k 4 O 5 0 3 V S V 7 Z J 8 c k o C c k D N y Q V q k T R i 5 J Y / k i T w 7 D 8 6 L 8 + q 8 z U Y r z j y z S 3 7 A e f 8 C 0 q y c 8 g = = < / l a t e x i t > W ✓ (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " y 4 r T R q 5 t B e p q e w K M p v + w j y e 7 7 E 0 = " > A A A C B X i c b V C 9 T s M w G H T K X w l / B U a W i A q J K U p Q J T p W Y m E s g t J K T V Q 5 r t N a 9 U 9 k O 1 W r q C s b K 7 w E I 2 L l O X g H H g I n z Q A t J 1 k 6 3 X 1 n f 7 4 o o U R p z / u y K h u b W 9 s 7 1 V 1 7 b / / g 8 K h 2 f P K o R C o R 7 i B B h e x F U G F K O O 5 o o i n u J R J D F l H c j S Y 3 u d + d Y q m I 4 A 9 6 n u C Q w R E n M U F Q G + k + m M 4 G t b r n e g W c d e K X p A 5 K t A e 1 7 2 A o U M o w 1 4 h C p f q + l + g w g 1 I T R P H C D l K F E 4 g m c I T 7 h n L I s A q z Y t W F c 2 G U o R M L a Q 7 X T q H + T m S Q K T V n k Z l k U I / V q p e L / 3 n 9 V M f N M C M 8 S T X m a P l Q n F J H C y f / t z M k E i N N 5 4 Z A J I n Z 1 U F j K C H S p h 0 7 K I J Z f u 0 A C c Y g H y p X 4 9 n C N v X 4 q 2 W s k + 6 V 6 z d c 3 7 9 r 1 F v N s q k q O A P n 4 B L 4 4 B q 0 w C 1 o g w 5 A Y A S e w Q t 4 t Z 6 s N + v d + l i O V q w y c w r + w P r 8 A a J F m R U = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " y 4 r T R q 5 t B e p q e w K M p v + w j y e 7 7 E 0 = " > A A A C B X i c b V C 9 T s M w G H T K X w l / B U a W i A q J K U p Q J T p W Y m E s g t J K T V Q 5 r t N a 9 U 9 k O 1 W r q C s b K 7 w E I 2 L l O X g H H g I n z Q A t J 1 k 6 3 X 1 n f 7 4 o o U R p z / u y K h u b W 9 s 7 1 V 1 7 b / / g 8 K h 2 f P K o R C o R 7 i B B h e x F U G F K O O 5 o o i n u J R J D F l H c j S Y 3 u d + d Y q m I 4 A 9 6 n u C Q w R E n M U F Q G + k + m M 4 G t b r n e g W c d e K X p A 5 K t A e 1 7 2 A o U M o w 1 4 h C p f q + l + g w g 1 I T R P H C D l K F E 4 g m c I T 7 h n L I s A q z Y t W F c 2 G U o R M L a Q 7 X T q H + T m S Q K T V n k Z l k U I / V q p e L / 3 n 9 V M f N M C M 8 S T X m a P l Q n F J H C y f / t z M k E i N N 5 4 Z A J I n Z 1 U F j K C H S p h 0 7 K I J Z f u 0 A C c Y g H y p X 4 9 n C N v X 4 q 2 W s k + 6 V 6 z d c 3 7 9 r 1 F v N s q k q O A P n 4 B L 4 4 B q 0 w C 1 o g w 5 A Y A S e w Q t 4 t Z 6 s N + v d + l i O V q w y c w r + w P r 8 A a J F m R U = < / l a t e x i t > We investigate the NPM objective in a controlled ablation experiment on a group of related protein families in PFAM (Finn et al., 2016) . In this artificial setting, information can be generalized by the pre-trained shared parameters to improve unsupervised contact prediction on a subset of the MSAs that have been artificially truncated to reduce their number of sequences. We then study the model in the setting of a large dataset without artificial reduction, training the model on MSAs for UniRef50 sequences. In this setting there is also an improvement on average for low depth MSAs both for sequences in the training set as well as for sequences not in the training set.

2. BACKGROUND

Multiple sequence alignments An MSA is a set of aligned protein sequences that are evolutionarily related. MSAs are constructed by retrieving related sequences from a sequence database and aligning the returned sequences using a heuristic. An MSA can be viewed as a matrix where each row is a sequence, and columns contain aligned positions after removing insertions and replacing deletions with gap characters.

Potts model

The generalized Potts model defines a Gibbs distribution over a protein sequence (x 1 , . . . , x L ) of length L with the negative energy function: -E(x) = i h i (x i ) + ij J ij (x i , x j ) Which defines potentials h i for each position in the sequence, and couplings J ij for every pair of positions. The parameters of the model are W = {h, J} the set of fields and couplings respectively. The distribution p(x; W ) is obtained by normalization as exp{-E(x; W )}/Z(W ). Since the normalization constant is intractable, pseudolikelihood is commonly used to fit the parameters (Balakrishnan et al., 2011; Ekeberg et al., 2013) . Pseudolikelihood approximates the likelihood of a sequence x as a product of conditional distributions: PL (x; W ) =i log p(x i |x -i ; W ). To estimate the Potts model, we take the expectation: L PL (W ) = E x∼M [ PL (x; W )] over an MSA M. In practice, we have a finite set of sequences M in the MSA to estimate Eq. ( 2). L 2 regularization ρ(W ) = λ J J 2 + λ h h 2 is added, and sequences are reweighted to account for redundancy (Morcos et al., 2011) . We write the regularized finite sample estimator as: LPL (W ) = 1 M eff M m=1 w m [ PL (x m ; W )] + ρ(W ) Which sums over all the M sequences of the finite MSA M, weighted with w m summing collectively to M eff . The finite sample estimate of the parameters Ŵ * is obtained by minimizing LPL . Idealized MSA Notice how in Eq. ( 2), we idealized the MSA M as a distribution, defined by the protein family. We consider the set of sequences actually retrieved in the MSA M in Eq. ( 3) as a finite sample from this underlying idealized distribution. For some protein families this sample will contain more information than for others, depending on what sequences are present in the database. We will refer to W * as a hypothetical idealized estimate of the parameters to explain how the Neural Potts Model can improve on the finite sample estimate Ŵ * for low-depth MSAs.



Figure 1: (a) Standard Potts model requires constructing an MSA and optimizing parameters W . (b) Neural Potts Model (NPM) predicts W in a single feedforward pass from a single sequence.

