A GRAPH NEURAL NETWORK APPROACH TO AUTO-MATED MODEL BUILDING IN CRYO-EM MAPS

Abstract

Electron cryo-microscopy (cryo-EM) produces three-dimensional (3D) maps of the electrostatic potential of biological macromolecules, including proteins. Along with knowledge about the imaged molecules, cryo-EM maps allow de novo atomic modeling, which is typically done through a laborious manual process. Taking inspiration from recent advances in machine learning applications to protein structure prediction, we propose a graph neural network (GNN) approach for the automated model building of proteins in cryo-EM maps. The GNN acts on a graph with nodes assigned to individual amino acids and edges representing the protein chain. Combining information from the voxel-based cryo-EM data, the amino acid sequence data, and prior knowledge about protein geometries, the GNN refines the geometry of the protein chain and classifies the amino acids for each of its nodes. Application to 28 test cases shows that our approach outperforms the state-of-the-art and approximates manual building for cryo-EM maps with resolutions better than 3.5 Å 1 .

1. INTRODUCTION

Following rapid developments in microscopy hardware and image processing software, cryo-EM structure determination of biological macromolecules is now possible to atomic resolution for favourable samples (Nakane et al., 2020; Yip et al., 2020) . For many other samples, such as large multi-component complexes and membrane proteins, resolutions around 3 Å are typical (Cheng, 2018) . Transmission electron microscopy images are taken of many copies of the same molecules, which are frozen in a thin layer of vitreous ice. Dedicated software, like RELION (Scheres, 2012) or cryoSPARC (Punjani et al., 2017) , implement iterative optimization algorithms to retrieve the orientation of each molecule and perform 3D reconstruction to obtain a voxel-based map of the underlying molecular structure. Provided the cryo-EM map is of sufficient resolution, it is interpreted in terms of an atomic model of the corresponding molecules. Many samples contain only proteins; other samples also contain other biological molecules, like lipids or nucleic acids. Proteins are linear chains of amino acids or residues. There are twenty different canonical amino acids that make up proteins. All of these amino acids have four heavy (non-hydrogen) atoms that make up the protein's main chain. The different amino acids have different numbers, types, and geometrical arrangements of their side-chain atoms. The smallest amino acid, glycine, has no heavy side chain atoms; the largest amino acid, tryptophan, has ten heavy side chain atoms. Typical proteins range in size from tens to more than a thousand residues. Typically, the electron microscopist knows which protein sequences are present in the sample. The task at hand is to build the atomic model, which identifies the positions of all atoms for all proteins that are present in the cryo-EM map. For each residue, there are two rotational degrees of freedom in the conformation of its main chain, the Φ and Ψ angles. Distinct orientations of the side chains provide additional conformational possibilities, the number of which depends on the type of amino acid (figure 1 ).



Code and weights are open-source and can be accessed at https://github.com/3dem/ model-angelo 1

