SIGNAL TO SEQUENCE ATTENTION-BASED MULTIPLE INSTANCE NETWORK FOR SEGMENTATION FREE INFER-ENCE OF RNA MODIFICATIONS

Abstract

Direct RNA sequencing technology works by allowing long RNA molecules to pass through tiny pores, generating electrical current, called squiggle, that are interpreted as a series of RNA nucleotides through the use of Deep Learning algorithms. The platform has also facilitated computational detection of RNA modifications via machine learning and statistical approaches as they cause detectable shift in the current generated as the modified nucleotides pass through the pores. Nevertheless, since modifications only occur in a handful of positions along the molecules, existing techniques require segmentation of the long squiggle in order to filter off irrelevant signals and this step produces large computational and storage overhead. Inspired by the recent work in vector similarity search, we introduce a segmentation-free approach by utilizing scaled-dot product attention to perform implicit segmentation and feature extraction of raw signals that correspond to sites of interest. We further demonstrate the feasibility of our approach by achieving significant speedup while maintaining competitive performance in m6A detection against existing state-of-the-art methods.

1. INTRODUCTION

RNA modifications have been discovered since the 1950s (Cohn & Volkin, 1951; Kemp & Allen, 1958; Davis & Allen, 1957) and have been found to play a prominent role in a wide range of biological processes (Xu et al., 2017; Yankova et al., 2021; Nombela et al., 2021) ]. Several methods exist to detect these modifications, most prominently N 6 -methyladenosine (m6A) (Meyer et al., 2012; Dominissini et al., 2012; Chen et al., 2015; Ke et al., 2015; Molinie et al., 2016; Linder et al., 2015; Koh et al., 2019; Dierks et al., 2021) ], pseudouridine (ψ) (Schwartz et al., 2014a; Lovejoy et al., 2014; Carlile et al., 2014; Liu et al., 2015) , and N 5 -methylcytosine (m5C) (Squires et al., 2012; Hussain et al., 2013; Huang et al., 2019) . These methods, while useful, require specific antibody or chemical reagents as well as experimental expertise that is beyond the reach of most computational labs. The recent development of direct RNA sequencing technology by Oxford Nanopore (Garalde et al., 2018) allows the direct sequencing of native RNA molecules. The technology works through the use of a motor protein that controls the translocation of RNA molecules through the nanopores, generating an electrical current called squiggle that corresponds to the identity of the molecules passing through the pores Figure 1a . The electrical current is deciphered into a sequence of four RNA nucleotides (G, A, C, U) through a process called basecalling and this involves training either Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN) (Boža et al., 2017; Stoiber & Brown, 2017; Teng et al., 2018; Zeng et al., 2020) using Connectionist Temporal Classification (CTC) approach (Graves et al., 2006) . The presence of a modified nucleotide often results in a shift in the electrical current which can be exploited for RNA modification detection. Nevertheless, modified nucleotides are rare and so only a short portion of a long RNA squiggle is relevant for modification detection. As a result, segmentation algorithms (Loman et al., 2015; Stoiber et al.) are often used by existing detection methods during preprocessing in order to extract useful signals matching to modified positions (Stoiber et al.; Leger et al., 2019; Lorenz et al., 2020; Ueda; Pratanwanich et al., 2021; Gao et al., 2021; Begik et al., 2021; Parker et al., 2021; Hendra et al. , (f) Our method skips the segmentation step and outputs modification probability corresponding to the candidate positions directly 2021; Stephenson et al., 2022; Sethi et al., 2022) . However, modifications such as m6A for example, mostly occur within 18 out of the 1024 possible 5-mer motifs (Meyer et al., 2012; Dominissini et al., 2012; Schwartz et al., 2014b) while other modifications such as m5C or pseudouridine only occur within segments containing the C or U nucleotides. Since segmentation algorithms typically segment the entire transcriptome, the modification detection pipeline often requires a huge storage space to store the segmentation results and suffers from slow running time due to the many preprocessing steps required to extract relevant features from the potential modified positions. In this work we attempt to address these shortcomings by putting together several machine learning techniques that can help to streamline the RNA modification detection process. Firstly, we make use of the deep features learnt by the CTC basecaller, with the aim of integrating modification detection to basecalling process in the future. Secondly, we implement an attention layer between sequence embeddings of the candidate modified positions and the deep CTC features to perform implicit segmentation and feature extraction of the target positions. Finally, to address the issue with noisy modification labels, we implement an end-to-end Attention-based Multiple Instance Learning approach on top of the extracted attention features so as to perform robust classification of modified positions. We validate our approach by performing m6A detection task and demonstrate that our approach is significantly faster than existing m6A detection methods while achieving comparable performance to the current state-of-the-art algorithm. Our work contributes to the field of RNA modification by developing a more scalable solution to RNA modification detection and we hope to drive a wider adoption of machine learning techniques to problems in biology, especially in long read RNA sequencing.

2. METHOD

The direct RNA sequencing workflow involves basecalling of RNA squiggles, followed by alignment of the basecalled results to the transcriptome (Figure 1 ) and for modification detection, another segmentation step is usually required by most detection algorithms. This step is often necessary as RNA squiggle is noisy and modifications only occur on a handful of positions which suggests that most of the signals are not useful for detecting RNA modifications. Nevertheless, segmentation algorithms produce a lot of unused segmented signal regions and most detection algorithms require



Figure 1: (a) RNA molecule being translocated through the Nanopore. Image is adapted from (Wan et al., 2022)(b) Electrical current, or nanopore squiggle, generated as the RNA nucleotides pass through the Nanopore. (c) The squiggle is deciphered a series of nucleotides through basecalling. RNA modification such as m6A modification can only occur in the presence of AC motif, so signals matching all other motifs might not be useful for detecting m6A modifications.(d) The basecalled sequence is mapped to the reference transcripts, correcting some errors made during basecalling. (e) Segmentation step is performed by most modification detection methods in order to map the squiggle corresponding to the candidate AC motif before further preprocessing and modification prediction. (f) Our method skips the segmentation step and outputs modification probability corresponding to the candidate positions directly

