GENE FINDING REVISITED: IMPROVED ROBUSTNESS THROUGH STRUCTURED DECODING FROM LEARNING EMBEDDINGS

Abstract

Gene finding is the task of identifying the locations of coding sequences within the vast amount of genetic code contained in the genome. With an ever increasing quantity of raw genome sequences, gene finding is an important avenue towards understanding the genetic information of (novel) organisms, as well as learning shared patterns across evolutionarily diverse species. The current state of the art are graphical models usually trained per organism and requiring manually curated data sets. However, these models lack the flexibility to incorporate deep learning representation learning techniques that have in recent years been transformative in the analysis of protein sequences, and which could potentially help gene finders exploit the growing number of the sequenced genomes to expand performance across multiple organisms. Here, we propose a novel approach, combining learned embeddings of raw genetic sequences with exact decoding using a latent conditional random field. We show that the model achieves performance matching the current state of the art, while increasing training robustness, and removing the need for manually fitted length distributions. As language models for DNA improve, this paves the way for more performant cross-organism gene-finders.

1. INTRODUCTION

Genes are patches of deoxyribonucleic acid (DNA) in our genome that encode functional and structural products of the cell. The central dogma of biology states that these segments are transcribed into ribonucleic acid (RNA) and in many cases translated into the amino acid sequences of proteins. In recent years, the machine learning community has dedicated considerable attention specifically to studying proteins, and solving various protein-related tasks, with the aid of deep learning. This focus has resulted in impressive advances within the field (Detlefsen et al., 2022; Jumper et al., 2021; Rao et al., 2020; Shin et al., 2021) . Less attention has been paid to the DNA sequences themselves, despite the fact that finding genes in a genome remains an important open problem. Due to technological advances, the rate by which genomes are sequenced is rising much more rapidly than we can reliably annotate genes experimentally, and without proper gene annotations, we lack information about the proteins encoded in these sequences. In particular, for taxonomies that are sparsely characterised or highly diverse, such as fungi, this hinders us from extracting essential information from newly sequenced genomes. The wealth of available genomic data suggests that this is an area ripe for a high-capacity deep learning approaches that automatically detect the most salient features in the data. This potential has in recent years been clearly demonstrated in the realm of proteins where deep learning has proven extremely effective in both the supervised setting (Alley et al., 2019; Hsu et al., 2022; Jumper et al., 2021) and in the unsupervised setting (Rao et al., 2021; 2020; Vig et al., 2021) . In particular, embeddings obtained in transformer based protein language models have pushed the boundaries for performance in many downstream sequence-based prediction tasks. The advantages of such models are two-fold: 1) they enable pre-training in cases where unlabelled data far outweighs labelled data and 2) they have demonstrated the ability to learn across diverse proteins. We are currently witnessing an emerging interest in language models for DNA as well, but progress in this area has proven more difficult than for its protein counterpart. In a completely unsupervised setting the amount of DNA data is orders of magnitude larger than that of proteins, and the signals are

