MULTI-SCALE SINUSOIDAL EMBEDDINGS ENABLE LEARNING ON HIGH RESOLUTION MASS SPECTROME-TRY DATA

Abstract

Small molecules in biological samples are studied to provide information about disease states, environmental toxins, natural product drug discovery, and many other applications. The primary window into the composition of small molecule mixtures is tandem mass spectrometry (MS2), which produces high sensitivity and part per million resolution data. We adopt multi-scale sinusoidal embeddings of the mass data in MS2 designed to meet the challenge of learning from the full resolution of MS2 data. Using these embeddings, we provide a new state of the art model for spectral library search, the standard task for initial evaluation of MS2 data. We also investigate the task of chemical property prediction from MS2 data, that has natural applications in high-throughput MS2 experiments and show that an average R 2 of 80% for novel compounds can be achieved across 10 chemical properties prioritized by medicinal chemists. We vary the resolution of the input spectra directly by using different floating point representations of the MS2 data, and show that the resulting sinusoidal embeddings are able to learn from high resolution portion of the input MS2 data. We apply dimensionality reduction to the embeddings that result from different resolution input masses to show the essential role multi-scale sinusoidal embeddings play in learning from MS2 data.

1. INTRODUCTION

Metabolomics is the study of the small molecule ( 1,000 Daltons) contents of complex biological samples. Tandem Mass Spectrometry (MS/MS), in conjunction with chromatography, is one of the most commonly used tools in metabolomics. Tandem Mass Spectrometry works by measuring with very high resolution the masses of molecules and their constituent fragments. While MS/MS techniques are highly sensitive and precise, inferring the identity of the molecules and their properties from the resulting mass spectra is commonly regarded as one of metabolomics' primary bottlenecks (Dunn et al., 2013) . Improved tools for these tasks will impact applications across many areas of science including disease diagnostics, characterization of disease pathways, development of new agrochemicals, improved forensics analysis, and the discovery of new drugs (Zhang et al., 2020) . Profiling unknown molecules with mass spectrometry consists of several steps. First, molecules of interest are ionized and separated by their mass to charge ratio (m/z), resulting in the MS1 spectrum. Then, individual "precursor" ions are fragmented, and the m/z's of the fragments are recorded in the same manner. The resulting spectrum contains the m/z's and intensities (together, the "peaks") of all resulting fragments, and is called the MS2 spectrum. See Glish & Vachet (2003) . In recent years, several machine learning methods have been developed to identify the structures and properties of small molecules from their mass spectra. These approaches (Huber et al., 2021a; b; Kutuzova et al., 2021; Litsa et al., 2021; Shrivastava et al., 2021; van Der Hooft et al., 2016) historically discretize m/z (via tokenization or binning). However, the m/z values obtained in modern mass spectrometry experiments are collected with parts per million levels of resolution. There are three critical reasons to expect that modeling m/z values as numeric quantities, in contrast to discretized values, is the appropriate technique. First, we know that relevant chemical information is present at the millidalton level (Jones et al., 2004; Pourshahian & Limbach, 2008) and discretization schemes typically strip this information away. Additionally, mass differences between peaks represent fragmentation patterns, and are therefore relevant to understanding the molecule. Learning to utilize this information from tokens is exceedingly difficult. Finally, discretization is susceptible to edge effects, where slight mass difference can map to different bins if the masses are close to the edge of a bin. In light of these considerations, we model m/z as a numerical value using sinusoidal embeddings, which we hypothesize will enable us to capture information across many orders of magnitude. In this work, we apply a numerical representation of m/z that uses sinusoidal embeddings across multiple scales to retain the information content of MS2 data across its entire mass resolution. To demonstrate the ability of these embeddings to enable learning at the high resolution of MS2 experimental data, we apply them to a search task and 10 regression tasks. Our first test task is spectral library search (Stein & Scott, 1994) . In spectral library search, spectra from unknown compounds are compared to spectra in a database to find matches using a similarity function over pairs of spectra. This task is the primary method used in standard metabolomics analyses, but is challenging because spectra for a compound vary widely with experimental conditions. We find that a similarity function based on sinusoidal embeddings achieves state of the art both for finding the exactly correct compound, and also for finding close structural analogs, which is useful for compounds not contained in spectral databases. We further investigate predicting chemical properties relevant to drug discovery from MS2 data and apply the same modeling approach to this task. We achieve 80% average R 2 for out of sample molecules across 10 properties, which is high enough to enable first-pass filtering and selection of candidate drug molecules in high-throughput experiments based solely on spectral data. In each task, using sinusoidal embeddings results in a new state of the art. The confirmation of across-task improvement provides evidence that the embeddings are a general improvement, rather than task specific. To determine whether these results are due to learning from the high resolution portion of the data, we experiment with inputting MS2 data cast to half precision floating point numbers, and show that the performance noticeably degrades relative to double precision. Finally, we visualize embeddings generated with varied floating point precision MS2 inputs using UMAP (McInnes et al., 2018) projections and show that non-trivial high dimensional structure only emerges with sufficiently high precision input. Taken together, these results are the first clear evidence that sinusoidal embeddings enable effective learning from high mass resolution MS2 data across tasks in metabolomics.

1.1. RELATED WORK

Modeling numerical data in terms of sinusoidal functions has a long history in many scientific fields. In machine learning, sinusoidal embeddings are most commonly used to encode the discrete positions of natural language token inputs in transformer models, which are otherwise position-agnostic (Vaswani et al., 2017) . Other work has used sinusoidal embeddings for multi-dimensional positional encoding in image recognition (Li et al., 2021a) . In mass spectrometry, sinusoidal embeddings have been used in proteomics for inferring protein sequences from mass values, either with (Qiao et al., 2019) or without (Yilmaz et al., 2022) an initial mass binning step, but without exploring their role in model performance or comparing to alternative approaches. These techniques have never been applied in the domain of metabolomics. Metabolomics differs from proteomics in that the molecules of of interest are 2 -3 orders of magnitude lower mass and are graph structured rather than sequential as in proteins. Consequently, the tasks and challenges of modeling of small molecules are sharply divergent from those in proteomics. For modeling m/z values in the field of metabolomics, previous machine learning models have relied primarily on discretization of the continuous mass inputs. This is usually accomplished by binning m/z values into fixed length vectors with peak intensity as the value for each element. Various authors have used binned representations of spectra for spectral library search (Huber et al., 2021b) , unsupervised topic modeling (van Der Hooft et al., 2016) , and molecule identification (Kutuzova et al., 2021; Litsa et al., 2021) . Alternatively, masses have been tokenized via rounding for tasks such as unsupervised spectral similarity (Huber et al., 2021a) and molecule prediction from synthetic data (Shrivastava et al., 2021) .

