MULTIPLE SEQUENCE ALIGNMENT AS A SEQUENCE-TO-SEQUENCE LEARNING PROBLEM

Abstract

The sequence alignment problem is one of the most fundamental problems in bioinformatics and a plethora of methods were devised to tackle it. Here we introduce BetaAlign, a methodology for aligning sequences using an NLP approach. BetaAlign accounts for the possible variability of the evolutionary process among different datasets by using an ensemble of transformers, each trained on millions of samples generated from a different evolutionary model. Our approach leads to alignment accuracy that is similar and often better than commonly used methods, such as MAFFT, DIALIGN, ClustalW, T-Coffee, PRANK, and MUSCLE.

1. INTRODUCTION

A multiple sequence alignment (MSA) provides a record of homology at the single position resolution within a set of homologous sequences. In order to infer the MSA from input sequences, one has to consider different evolutionary events, such as substitutions and indels (i.e., insertions and deletions). MSAs can be computed for DNA, RNA, or amino acid sequences. MSA inference is considered one of the most common problems in biology (Van Noorden et al., 2014) . Moreover, MSAs are required input for various widely-used bioinformatics methods such as domain analysis, phylogenetic reconstruction, motif finding, and ancestral sequence reconstruction (Kemena & Notredame, 2009; Avram et al., 2019) . These methods assume the correctness of the MSA, and their performance might degrade when inaccurate MSAs are used as input (Ogden & Rosenberg, 2006; Privman et al., 2012) . MSA algorithms typically assume a fixed cost matrix as input, i.e., the penalty for aligning nonidentical characters and the reward for aligning identical characters. They also assume a penalty for the introduction of gaps. These costs dictate the score of each possible alignment, and the algorithm aims to output the alignment with the highest score. Previous studies demonstrated that fitting the cost matrix configuration to the data increases the MSA inference accuracy (Rubio-Largo et al., 2018; Llinares-López et al., 2021) . Thus, MSA algorithms often allow users to tune parameters that control the MSA computation. However, in practice, these parameters are seldom altered, and only a few default configurations are used. Alignment algorithms are often benchmarked against empirical alignment regions, which are thought to be reliable. However, such regions within alignments do not reflect the universe of alignment problems (Thompson et al., 1994) . Of note, these regions were often manually computed, so there is no guarantee that they represent a reliable "gold standard" (Morrison, 2009) . When testing alignment programs with simulated complex alignments, the results differ from the benchmarks results (Chang et al., 2014) . *Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion. In the last decade, deep-learning algorithms have revolutionized various fields (LeCun et al., 2015) , including computer vision (Voulodimos et al., 2018) , natural language processing (NLP) (Young et al., 2018) , sequence correction (Baid et al., 2021) , and medical diagnosis (Rakocz et al., 2021; Hill et al., 2021) . Neural-network solutions often resulted in a substantial increase in prediction accuracy compared to traditional algorithms. Deep learning has also changed molecular biology and evolutionary research, e.g., by allowing accurate predictions of three-dimensional protein structures using AlphaFold (Jumper et al., 2021) . We propose BetaAlign, a deep-learning method that is trained on known alignments, and is able to accurately align novel sets of sequences. The method is based on the "transformer" architecture (Vaswani et al., 2017) , a recent deep-learning architecture designed for sequence-to-sequence tasks, which was originally developed for machine translation. BetaAlign was trained on millions of alignments drawn from different evolutionary models. Our analyses demonstrate that BetaAlign has comparable, and in some cases superior, accuracy compared to the most popular MSA algorithms: T-Coffee (Notredame et al., 2000) , ClustalW (Larkin et al., 2007) , DIALIGN (Morgenstern, 2004) , MUSCLE (Edgar, 2004) , MAFFT (Katoh & Standley, 2013) and PRANK (Löytynoja & Goldman, 2008) . BetaAlign converts the alignment problem into a sequence-to-sequence learning problem, a wellstudied problem within the NLP field (Hirschberg & Manning, 2015) . We first present the NLPbased approach for pairwise alignment. We represent both the unaligned sequences and the resulting alignment as sentences in two different "languages". The language of the input sequences (the source language) is termed Concat, because, in this representation, each pair of input sequences is concatenated with the pipe character ("|") representing the border between the sequences. In this language, each character in the resulting string is considered a different token. For example, when we aligned the sequence "AAG" against the sequence "ACGG", in the Concat language, this will be represented by the sentence "A A G | A C G G" (Fig. 1 ). In this sentence, there are eight tokens. Of note, in the Concat language there are only five possible unique tokens. A dictionary of a language is the entire set of tokens used in that language, and thus in the Concat language, the dictionary is the set {"A", "C", "G", "T", "|"}. The target language, i.e., the language of the output alignment, is termed here Spaces. In this language, the dictionary is the set {"A", "C", "G", "T", "-"}. In this representation, the tokens of the two different aligned sequences are interleaved. Thus, the first two tokens in the Spaces output sentence are the first character of the first sequence and the first character of the second sequence, respectively. The third and fourth tokens correspond to the second column of the pairwise alignment, etc. In the above example, assume that in the output alignment "AA-G" is aligned to "ACGG", then, in the Spaces language, this will be represented as the sentence "A A A C -G G G" (Fig. 1 ). We have compared the performance of different source and target representations (see Section A.2.2). With these representations, the input sequences are a sentence in one language, the output alignment is a sentence in another language, and a perfect alignment algorithm is requested to accurately transduce one language to the other. The details regarding the training and execution of the NLP transformers are provided in Section 2, and Section A. Of note, the occasional generation of invalid alignments was encountered while developing BetaAlign. This challenge was addressed by introducing an ensemble of transformers (see Section A.2.3).



Figure 1: An illustration of aligning sequences with sequence-to-sequence learning. (a) Consider two input sequences "AAG" and "ACGG". (b) The result of encoding the unaligned sequences into the source language (Concat representation). (c) The sentence from the source language is translated to the target language via a transformer model. (d) The translated sentence in the target language (Spaces representation). (e) The resulting alignment, decoded from the translated sentence, in which "AA-G" is aligned to "ACGG". The transformer architecture illustration is adapted from Vaswani et al. (2017).

