MATCHING RECEPTOR TO ODORANT WITH PROTEIN LANGUAGE AND GRAPH NEURAL NETWORKS

Abstract

Odor perception in mammals is triggered by interactions between volatile organic compounds and a subset of hundreds of proteins called olfactory receptors (ORs). Molecules activate these receptors in a complex combinatorial coding allowing mammals to discriminate a vast number of chemical stimuli. Recently, ORs have gained attention as new therapeutic targets following the discovery of their involvement in other physiological processes and diseases. To date, predicting molecule-induced activation for ORs is highly challenging since 43% of ORs have no identified active compound. In this work, we combine [CLS] token from protBERT with a molecular graph and propose a tailored GNN architecture incorporating inductive biases from the protein-molecule binding. We abstract the biological process of protein-molecule activation as the injection of a molecule into a protein-specific environment. On a newly gathered dataset of 46 700 ORmolecule pairs, this model outperforms state-of-the-art models on drug-target interaction prediction as well as standard GNN baselines. Moreover, by incorporating non-bonded interactions the model is able to work with mixtures of compounds. Finally, our predictions reveal a similar activation pattern for molecules within a given odor family, which is in agreement with the theory of combinatorial coding in olfaction.

1. INTRODUCTION

Mammalian sense of smell constantly provides information about the composition of the volatile chemical environment and is able to discriminate thousands of different molecules. At the atomic scale, volatile organic compounds are recognized by specific interactions with protein receptors expressed at the surface of olfactory neurons (Buck & Axel, 1991) . Mammalian epithelium expresses hundreds of different olfactory receptors (ORs), belonging to the G protein-coupled receptors (GPCRs), which constitute the largest known multigene family (Niimura & Nei, 2003) . The recognition of odorants by ORs is based on the complementarity of structures and hydrophobic or van der Waals interactions which leads to low molecular affinity (Katada et al., 2005) . With the exception of a few conserved amino acids, the sequences of ORs show little identity. In particular, the ligand-binding pocket has hypervariable residues (Pilpel & Lancet, 1999) that are relatively well conserved between orthologs. This property gives ORs the ability to bind a wide variety of molecules that differ in structure, size, or chemical properties. The recognition of odorants is done according to the combinatorial code of activation (Malnic et al., 1999) . Each odorant is recognized by several ORs, whereas an individual OR can bind several odorants with distinct affinities and specificities (Zhao et al., 1998) . This combinatorial code is sensitive to subtle modifications, so the response of a single receptor can have a major influence on the smell perception. Even a small sequence modification could affect odorant responsiveness (Keller et al., 2007; Mainland et al., 2014) . On the other hand, structural and functional modifications of an odorant can abolish the interaction with a specific receptor (Katada et al., 2005) , and even lead to a different smell perception (Sell, 2006) . So far the combinatorial code of the majority of odorants remains unknown. Identifying the recognition spectrum of each OR is therefore essential to decipher the mechanisms of the olfactory system and subsequently build models capable of cracking the combinatorial code of activation. There is only a limited number of models designed to match ligands and ORs. Namely Kowalewski & Ray (2020) follow a molecule-oriented approach and predict agonists for a subset of 34 ORs (24 wild types and 10 variants) by representing molecules via fingerprints (Morgan, 1965; Klekota & Roth, 2008) and building an individual SVM model for each OR. On the other hand, Cong et al. ( 2022) focus on receptors and address a more complex problem of predicting active ORs for 4 given molecules. In their random forest model, each amino acid of the protein sequence is described by 3 physico-chemical properties and the molecules by a subset of Dragon descriptors (Mauri et al., 2006) . In a more general approach, Gupta et al. ( 2021) consider any OR-molecule pair and use BiLSTM (Graves & Schmidhuber, 2005) to predict the binding of a molecule, represented by SMILES string, to an OR sequence. In contrast, in this work we use a graph and a [CLS] token embedding from protBERT (Elnaggar et al., 2021) to represent molecules and receptors, respectively. We abstract receptor-molecule binding as the injection of a molecule into a protein specific environment. This is achieved by using molecular topology as a layout for the message passing process and copying the protein representation to each node of the molecular graph. As a result, the redundancy of protein information enables a local processing of the "protein environment" and achieves better performance than a common strategy of processing the receptor and the molecule in parallel. This abstraction leads to a graph with the number of nodes depending only on the size of the molecule. Molecules with flexible moieties can undergo conformational changes upon binding to maximize interactions with the receptor binding cavity. This structural adaptation modifies the strength of internal non-bonded forces. However, standard GNN architectures are restricted to the molecular topology. Thus, we have built a tailored GNN architecture combining local interaction of bonded atoms, as done by standard GNNs, with multi-head attention, giving the model the ability to incorporate interactions between any pair of atoms. We show that this architecture outperforms other baselines as well as previous work on olfactory receptor-molecule activation prediction. Finally, we found a relationship between human odor perception and model predictions, strengthening the biological relevance of the model. The results for humans are in full agreement with the experimental work done by Nara et al. (2011) . By analyzing the predictions for human ORs, we observe that the combinatorial codes exhibit large diversity. The OR repertoire contains mostly narrow receptors with several broadly-tuned ones. The results also highlight the existence of odor-specific receptors, but most odors are coded in a complex activation pattern.

2.1. PROTEIN LANGUAGE MODELS

Recently, protein language models emerged as unsupervised structure learners (Rao et al., 2021b; Vig et al., 2021; Rives et al., 2021) , allowing to extract abstract vector representations of proteins. As in natural language processing (NLP), large models with millions of parameters (e.g BERT (Devlin et al., 2019) ) are trained on vast databases of amino acid sequences (Steinegger et al., 2019; Steinegger & Söding, 2018; UniProt Consortium, 2019; Suzek et al., 2007) . Rao et al. (2019) trained and evaluated various natural language processing models on a set of structurally relevant tasks. Elnaggar et al. (2021) went further and trained a list of powerful NLP models ranging from 200M to 11B parameters. Recently proposed MSA Transformer (Rao et al., 2021a) exploits evolutionary relationships by using Multiple Sequence Alignment (MSA) as input rather than a single protein sequence at a time. AlphaFold2 (Jumper et al., 2021; Evans et al., 2021) extends the idea of using MSA and combines it with an experimentally obtained protein template in an end-to-end model trained on supervised structure prediction.

2.2. GRAPH NEURAL NETWORKS

In recent years, graph neural networks (Kipf & Welling, 2016; Gilmer et al., 2017; Simonovsky & Komodakis, 2017; Veličković et al., 2018; Wang et al., 2018; Zhou et al., 2018; Battaglia et al., 2018) have grown rapidly in popularity and received considerable attention in various domains such as drug design (Torng & Altman, 2019 ), physics (Shlomi et al., 2021 ), and chemistry (Gilmer et al., 2017; Yang et al., 2019; Gasteiger et al., 2020b; a) . Chemistry is a particularly promising field for GNN applications since a molecule can be naturally represented as a graph G = {V, E}, where V is the set

