A GRAPH NEURAL NETWORK APPROACH TO AUTO-MATED MODEL BUILDING IN CRYO-EM MAPS

Abstract

Electron cryo-microscopy (cryo-EM) produces three-dimensional (3D) maps of the electrostatic potential of biological macromolecules, including proteins. Along with knowledge about the imaged molecules, cryo-EM maps allow de novo atomic modeling, which is typically done through a laborious manual process. Taking inspiration from recent advances in machine learning applications to protein structure prediction, we propose a graph neural network (GNN) approach for the automated model building of proteins in cryo-EM maps. The GNN acts on a graph with nodes assigned to individual amino acids and edges representing the protein chain. Combining information from the voxel-based cryo-EM data, the amino acid sequence data, and prior knowledge about protein geometries, the GNN refines the geometry of the protein chain and classifies the amino acids for each of its nodes. Application to 28 test cases shows that our approach outperforms the state-of-the-art and approximates manual building for cryo-EM maps with resolutions better than 3.5 Å 1 .

1. INTRODUCTION

Following rapid developments in microscopy hardware and image processing software, cryo-EM structure determination of biological macromolecules is now possible to atomic resolution for favourable samples (Nakane et al., 2020; Yip et al., 2020) . For many other samples, such as large multi-component complexes and membrane proteins, resolutions around 3 Å are typical (Cheng, 2018) . Transmission electron microscopy images are taken of many copies of the same molecules, which are frozen in a thin layer of vitreous ice. Dedicated software, like RELION (Scheres, 2012) or cryoSPARC (Punjani et al., 2017) , implement iterative optimization algorithms to retrieve the orientation of each molecule and perform 3D reconstruction to obtain a voxel-based map of the underlying molecular structure. Provided the cryo-EM map is of sufficient resolution, it is interpreted in terms of an atomic model of the corresponding molecules. Many samples contain only proteins; other samples also contain other biological molecules, like lipids or nucleic acids. Proteins are linear chains of amino acids or residues. There are twenty different canonical amino acids that make up proteins. All of these amino acids have four heavy (non-hydrogen) atoms that make up the protein's main chain. The different amino acids have different numbers, types, and geometrical arrangements of their side-chain atoms. The smallest amino acid, glycine, has no heavy side chain atoms; the largest amino acid, tryptophan, has ten heavy side chain atoms. Typical proteins range in size from tens to more than a thousand residues. Typically, the electron microscopist knows which protein sequences are present in the sample. The task at hand is to build the atomic model, which identifies the positions of all atoms for all proteins that are present in the cryo-EM map. For each residue, there are two rotational degrees of freedom in the conformation of its main chain, the Φ and Ψ angles. Distinct orientations of the side chains provide additional conformational possibilities, the number of which depends on the type of amino acid (figure 1 ). Atomic model building in cryo-EM maps is typically done manually using 3D visualisation software, (e.g. Emsley et al., 2010; Pettersen et al., 2021) , followed by refinement procedures that optimize the fit of the models in the map, (e.g. Murshudov et al., 2011; Croll, 2018; Liebschner et al., 2019) . Often, in areas of weak density in the map, one cannot discern the amino acid identity of residues from the map alone and sequence information has to be used to make an accurate assessment. Manually building a reliable atomic model de novo in the reconstructed cryo-EM map is considered to be difficult for maps with resolutions worse than 4 Å. Although the task is more straightforward for maps with resolutions better than 3 Å, it still typically requires large amounts of time and a high level of expertise. Machine learning has recently achieved a major step forward in structure prediction for individual proteins (Jumper et al., 2021; Baek et al., 2021) . In these approaches, the sequence information of proteins and their evolutionary related homologues is used to predict their atomic structure without the use of experimental data. In addition, protein language models, which are trained in an unsupervised fashion on the amino acid sequences of many proteins, have also provided useful results in protein structure prediction (Lin et al., 2022; Wu et al., 2022a) . Although these techniques are not yet capable of reliably predicting structures of the larger complexes that are typically studied by cryo-EM, their success for individual proteins inspired us to explore similar approaches for automated model building in cryo-EM maps. In this paper, we present a single integrated GNN that combines the voxel-based information from the cryo-EM map with information from the protein sequence through a protein language model, and information from the topology of the graph through invariant point attention (IPA) (Jumper et al., 2021) . For 28 test cases, we demonstrate that our approach approximates the accuracy of manual model building for maps with resolutions better than 3.5 Å.

2. PRIOR WORK

Automated model building. Automated approaches for atomic modeling in the related experimental technique of X-ray crystallography have existed for many years (for example, Perrakis et al., 1999; Cowtan, 2006; Terwilliger et al., 2008) . Some of these approaches have also been applied to cryo-EM maps. For example, the PHENIX package builds models that are on average 47% complete for cryo-EM maps with resolutions worse than 3 Å (Terwilliger et al., 2018) . For similar maps, MAINMAST, an approach that was designed to build C α main-chain traces in cryo-EM maps, often produces models with root mean squared deviations (RMSDs) in the range of tens of Å (Terashi & Kihara, 2018) . Relatively incomplete models, with large residuals, have limited the impact of these techniques on automated model building in the cryo-EM field thus far. More recently, Deeptracer (Pfab et al., 2021) , the first deep-learning approach for automated atomic modeling in cryo-EM maps, was reported to outperform these earlier approaches. Deeptracer uses U-Nets (Ronneberger et al., 2015) to construct an atomic model de novo in the cryo-EM map. In contrast to our work, Deeptracer does not integrate the sequence information with the U-Net, and it does not use a graph representation of the protein chain during model refinement. Instead, Deeptracer treats the entire problem as a segmentation and classification problem. Thereby, it also does



Code and weights are open-source and can be accessed at https://github.com/3dem/ model-angelo



Figure 1: (A) shows a peptide backbone where the chain is ordered from left to right. Arrows mark the peptide bonds between the C atom of one residue and the N atom of the next. (B) shows six amino acids, pointing out the differences in the side chains, marked by the outline. (C) shows the Φ and Ψ angles of the backbone and the additional rotatable bond of the side chain for threonine.

