ACCURATE AND FAST DETECTION OF COPY NUM-BER VARIATIONS FROM SHORT-READ WHOLE-GENOME SEQUENCING WITH DEEP CONVOLUTIONAL NEURAL NETWORK

Abstract

A copy number variant (CNV) is a type of genetic mutation where a stretch of DNA is lost or duplicated once or multiple times. CNVs play important roles in the development of diseases and complex traits. CNV detection with short-read DNA sequencing technology is challenging because CNVs significantly vary in size and are similar to DNA sequencing artifacts. Many methods have been developed but still yield unsatisfactory results with high computational costs. Here, we propose CNV-Net, a novel approach for CNV detection using a six-layer convolutional neural network. We encode DNA sequencing information into RGB images and train the convolutional neural network with these images. The fitted convolutional neural network can then be used to predict CNVs from DNA sequencing data. We benchmark CNV-Net with two high-quality whole-genome sequencing datasets available from the Genome in a Bottle Consortium, considered as gold standard benchmarking datasets for CNV detection. We demonstrate that CNV-Net is more accurate and efficient in CNV detection than current tools.

1. INTRODUCTION

A copy number variant (CNV) is a genetic mutation where a stretch of DNA is completely lost or repeated more than once compared to a reference genome. CNV sizes range from 50 bases to 3 million bases or more with two major types: duplication if the DNA sequence is repeated and deletion if a DNA sequence is missing. CNVs are spread along a genome, which account for 4.8% to 9.5% of human genome (Zarrei et al., 2015) . They are known to influence many complex diseases including autism (Sebat et al., 2007 ), bipolar disorder (Green et al., 2016 ), schizophrenia (Stefansson et al., 2008 ), and Alzheimer' disease (Cuccaro et al., 2017) as well as gene expression (Chiang et al., 2017) . As CNVs may overlap with a large portion of a genome, such as one gene or even multiple genes, their effect on disease may be substantially larger than that of single nucleotide variants, and hence CNVs often play important roles in the genetic mechanisms of diseases and complex traits. With the advent of next-generation DNA sequencing technologies over the past decade, the resolution and scale of CNV detection has been greatly improved as large-scale sequencing studies become feasible. However, CNVs are still challenging to detect from short-read next-generation DNA sequencing techniques due to the significantly varying sizes of CNVs and their similarity with common DNA sequencing artifacts. Many computational methods have been developed for the discovery of CNVs from short-read DNA sequencing data, but their performance is often unsatisfactory due to low accuracy and high computational cost (Kosugi et al., 2019) . The main reason for this is that previous methods (Rausch et al., 2012; Abyzov et al., 2011) mainly rely on the statistical analysis of the signals from read alignments (a process that maps reads from DNA sequencing data to the reference genome), which often fail to employ all useful features of DNA sequencing data and require significant computational resources. Therefore, there is a need for a novel, sophisticated computational tool to improve CNV detection with higher accuracy and efficiency. Here, we present CNV-Net, a new approach to identify CNVs from DNA sequencing data using a six-layer deep convolutional neural network (CNN). CNV-Net first encodes the reads and their 1

