GENE FINDING REVISITED: IMPROVED ROBUSTNESS THROUGH STRUCTURED DECODING FROM LEARNING EMBEDDINGS

Abstract

Gene finding is the task of identifying the locations of coding sequences within the vast amount of genetic code contained in the genome. With an ever increasing quantity of raw genome sequences, gene finding is an important avenue towards understanding the genetic information of (novel) organisms, as well as learning shared patterns across evolutionarily diverse species. The current state of the art are graphical models usually trained per organism and requiring manually curated data sets. However, these models lack the flexibility to incorporate deep learning representation learning techniques that have in recent years been transformative in the analysis of protein sequences, and which could potentially help gene finders exploit the growing number of the sequenced genomes to expand performance across multiple organisms. Here, we propose a novel approach, combining learned embeddings of raw genetic sequences with exact decoding using a latent conditional random field. We show that the model achieves performance matching the current state of the art, while increasing training robustness, and removing the need for manually fitted length distributions. As language models for DNA improve, this paves the way for more performant cross-organism gene-finders.

1. INTRODUCTION

Genes are patches of deoxyribonucleic acid (DNA) in our genome that encode functional and structural products of the cell. The central dogma of biology states that these segments are transcribed into ribonucleic acid (RNA) and in many cases translated into the amino acid sequences of proteins. In recent years, the machine learning community has dedicated considerable attention specifically to studying proteins, and solving various protein-related tasks, with the aid of deep learning. This focus has resulted in impressive advances within the field (Detlefsen et al., 2022; Jumper et al., 2021; Rao et al., 2020; Shin et al., 2021) . Less attention has been paid to the DNA sequences themselves, despite the fact that finding genes in a genome remains an important open problem. Due to technological advances, the rate by which genomes are sequenced is rising much more rapidly than we can reliably annotate genes experimentally, and without proper gene annotations, we lack information about the proteins encoded in these sequences. In particular, for taxonomies that are sparsely characterised or highly diverse, such as fungi, this hinders us from extracting essential information from newly sequenced genomes. The wealth of available genomic data suggests that this is an area ripe for a high-capacity deep learning approaches that automatically detect the most salient features in the data. This potential has in recent years been clearly demonstrated in the realm of proteins where deep learning has proven extremely effective in both the supervised setting (Alley et al., 2019; Hsu et al., 2022; Jumper et al., 2021) and in the unsupervised setting (Rao et al., 2021; 2020; Vig et al., 2021) . In particular, embeddings obtained in transformer based protein language models have pushed the boundaries for performance in many downstream sequence-based prediction tasks. The advantages of such models are two-fold: 1) they enable pre-training in cases where unlabelled data far outweighs labelled data and 2) they have demonstrated the ability to learn across diverse proteins. We are currently witnessing an emerging interest in language models for DNA as well, but progress in this area has proven more difficult than for its protein counterpart. In a completely unsupervised setting the amount of DNA data is orders of magnitude larger than that of proteins, and the signals are correspondingly sparse. For instance, Eukaryotic genomes consist of millions to billions of DNA base pairs but only a small percentage are genes and an even smaller percentage codes for protein (approx. 1% in the human genome). Genes also have an intricate structure which places demands on a high degrees of consistency between output labels predicted at different positions in the sequence. In particular, genes contain both coding segments (called CDS or exon) and intron segments. Only the coding segments are retained in the final protein, while the introns are removed. The process in which introns are removed is called splicing, which occurs after the gene is transcribed from DNA to RNA. After introns are spliced out, the RNA is translated to amino acid sequences (the protein product). Each amino acid is encoded by a codon (a triplet of DNA nucleotides) in the RNA sequence. Due to this codon structure, the annotation of the coding sequences in the gene must be extremely accurate, as shifting the frame of the codon with just one will result in a nonsensical protein. Gene prediction thus consists of correctly annotating the boundaries of a gene as well as the intron splice sites (donor/acceptor sites), a task challenged both by the imbalanced data but also by the extreme accuracy needed. The current state-of-the-art in gene-finding relies on Hidden Markov Models (HMMs) and exact decoding (e.g. Viterbi) to ensure the required consistency among predictions at different output positions. To make these methods work well in practice, considerable effort has been put into hand-coded length distributions inside the HMM transition matrix, and a careful curation of the training data to ensure that the length statistics are representative for the genome in question. The resulting HMMs have dominated the field for more than two decades. However, their performance still leaves a lot to be desired, they are generally difficult to train, and have no mechanism for incorporating learned embeddings and context dependent learned length distributions. These models can be improved by incorporating them with external hints and constructing pipelines (Hoff et al., 2016) but they are not compatible with deep learning advents that have revolutionised adjacent fields. The goal with this paper is to develop a new approach that is compatible with contemporary deep learning practices, can be trained without manual feature engineering and careful data curation, while maintaining the capability for exact decoding. Here we present an approach, which we term GeneDecoder, to gene prediction that is able to both incorporate prior knowledge of gene structure in the form of a latent graphs in a Conditional Random Fields as well as embeddings learned directly from the DNA sequence. This approach proves easy to train naively while still achieving high performance across a range of diverse genomes. We highlight that the resulting model is very flexible and open to improvement either by including external hints or by improvement of the current DNA sequence embeddings. We benchmark against three other supervised algorithms (Augustus, Snap, GlimmerHMM) and find that the performance of our model competes with that of the state-of-the-art (Scalzitti et al., 2020) without a strong effort put into model-selection. However, as pre-trained DNA models start to emerge and improve we expect that the full potential of this approach will be realised.

2. RELATED WORK

Current gene prediction algorithms are Hidden Markov Models (HMM) or Generalized HMMs. These include Augustus (Stanke & Waack, 2003) , Snap (Korf, 2004) ., GlimmerHMM (Majoros et al., 2004) and Genemark.hmm (Borodovsky & Lomsadze, 2011) . All these models are trained fully supervised and on a per-organism basis. Genemark also exists in a version that is similarly trained on one organism but in an iterative self-supervised manner (Ter-Hovhannisyan et al., 2008) . In practice, gene prediction is often done through meta-prediction pipelines such as Braker (Hoff et al., 2016 ), Maker2 (Holt & Yandell, 2011 ) and Snowyowl (Reid et al., 2014) , which typically combine preexisting HMMs with external hints (e.g. protein or RNA alignments) and/or iterated training. Focusing on the individual gene predictors, Augustus is currently the tool of choice in the supervised setting, according to a recent benchmark study (Scalzitti et al., 2020) . It is an HMM with explicit length distributions for introns and CDS states. The model also includes multiple types of intron and exon states emitting either fixed-length sequences or sequences from a length distribution given by a specific choice of self-transition between states. This intron model has been shown to be key to its performance; without it Augustus was found to be incapable of modelling length distributions in introns correctly (Stanke & Waack, 2003) . Models like Snap and GlimmerHMM follow similar ideas, but differ in their transition structure. In particular, GlimmerHMM includes a splice site model from Genesplicer (Pertea et al., 2001) . These HMM-gene predictors are known to be

