MULTIPLE SEQUENCE ALIGNMENT AS A SEQUENCE-TO-SEQUENCE LEARNING PROBLEM

Abstract

The sequence alignment problem is one of the most fundamental problems in bioinformatics and a plethora of methods were devised to tackle it. Here we introduce BetaAlign, a methodology for aligning sequences using an NLP approach. BetaAlign accounts for the possible variability of the evolutionary process among different datasets by using an ensemble of transformers, each trained on millions of samples generated from a different evolutionary model. Our approach leads to alignment accuracy that is similar and often better than commonly used methods, such as MAFFT, DIALIGN, ClustalW, T-Coffee, PRANK, and MUSCLE.

1. INTRODUCTION

A multiple sequence alignment (MSA) provides a record of homology at the single position resolution within a set of homologous sequences. In order to infer the MSA from input sequences, one has to consider different evolutionary events, such as substitutions and indels (i.e., insertions and deletions). MSAs can be computed for DNA, RNA, or amino acid sequences. MSA inference is considered one of the most common problems in biology (Van Noorden et al., 2014) . Moreover, MSAs are required input for various widely-used bioinformatics methods such as domain analysis, phylogenetic reconstruction, motif finding, and ancestral sequence reconstruction (Kemena & Notredame, 2009; Avram et al., 2019) . These methods assume the correctness of the MSA, and their performance might degrade when inaccurate MSAs are used as input (Ogden & Rosenberg, 2006; Privman et al., 2012) . MSA algorithms typically assume a fixed cost matrix as input, i.e., the penalty for aligning nonidentical characters and the reward for aligning identical characters. They also assume a penalty for the introduction of gaps. These costs dictate the score of each possible alignment, and the algorithm aims to output the alignment with the highest score. Previous studies demonstrated that fitting the cost matrix configuration to the data increases the MSA inference accuracy (Rubio-Largo et al., 2018; Llinares-López et al., 2021) . Thus, MSA algorithms often allow users to tune parameters that control the MSA computation. However, in practice, these parameters are seldom altered, and only a few default configurations are used. Alignment algorithms are often benchmarked against empirical alignment regions, which are thought to be reliable. However, such regions within alignments do not reflect the universe of alignment problems (Thompson et al., 1994) . Of note, these regions were often manually computed, so there is no guarantee that they represent a reliable "gold standard" (Morrison, 2009) . When testing alignment programs with simulated complex alignments, the results differ from the benchmarks results (Chang et al., 2014) . *Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion. 1

