ACCURATE AND FAST DETECTION OF COPY NUM-BER VARIATIONS FROM SHORT-READ WHOLE-GENOME SEQUENCING WITH DEEP CONVOLUTIONAL NEURAL NETWORK

Abstract

A copy number variant (CNV) is a type of genetic mutation where a stretch of DNA is lost or duplicated once or multiple times. CNVs play important roles in the development of diseases and complex traits. CNV detection with short-read DNA sequencing technology is challenging because CNVs significantly vary in size and are similar to DNA sequencing artifacts. Many methods have been developed but still yield unsatisfactory results with high computational costs. Here, we propose CNV-Net, a novel approach for CNV detection using a six-layer convolutional neural network. We encode DNA sequencing information into RGB images and train the convolutional neural network with these images. The fitted convolutional neural network can then be used to predict CNVs from DNA sequencing data. We benchmark CNV-Net with two high-quality whole-genome sequencing datasets available from the Genome in a Bottle Consortium, considered as gold standard benchmarking datasets for CNV detection. We demonstrate that CNV-Net is more accurate and efficient in CNV detection than current tools.

1. INTRODUCTION

A copy number variant (CNV) is a genetic mutation where a stretch of DNA is completely lost or repeated more than once compared to a reference genome. CNV sizes range from 50 bases to 3 million bases or more with two major types: duplication if the DNA sequence is repeated and deletion if a DNA sequence is missing. CNVs are spread along a genome, which account for 4.8% to 9.5% of human genome (Zarrei et al., 2015) . They are known to influence many complex diseases including autism (Sebat et al., 2007 ), bipolar disorder (Green et al., 2016 ), schizophrenia (Stefansson et al., 2008 ), and Alzheimer' disease (Cuccaro et al., 2017) as well as gene expression (Chiang et al., 2017) . As CNVs may overlap with a large portion of a genome, such as one gene or even multiple genes, their effect on disease may be substantially larger than that of single nucleotide variants, and hence CNVs often play important roles in the genetic mechanisms of diseases and complex traits. With the advent of next-generation DNA sequencing technologies over the past decade, the resolution and scale of CNV detection has been greatly improved as large-scale sequencing studies become feasible. However, CNVs are still challenging to detect from short-read next-generation DNA sequencing techniques due to the significantly varying sizes of CNVs and their similarity with common DNA sequencing artifacts. Many computational methods have been developed for the discovery of CNVs from short-read DNA sequencing data, but their performance is often unsatisfactory due to low accuracy and high computational cost (Kosugi et al., 2019) . The main reason for this is that previous methods (Rausch et al., 2012; Abyzov et al., 2011) mainly rely on the statistical analysis of the signals from read alignments (a process that maps reads from DNA sequencing data to the reference genome), which often fail to employ all useful features of DNA sequencing data and require significant computational resources. Therefore, there is a need for a novel, sophisticated computational tool to improve CNV detection with higher accuracy and efficiency. Here, we present CNV-Net, a new approach to identify CNVs from DNA sequencing data using a six-layer deep convolutional neural network (CNN). CNV-Net first encodes the reads and their related information from DNA sequencing data such as the type of base, base coverage (the number of reads covering a position), and base quality of each position into the RGB channels of images. Then it uses a deep CNN to predict these images as deletions, duplications, or false positives (not a CNV) breakpoints. Note that a CNV is defined by two breakpoints, that is, its start and end positions. Although it has been shown that CNNs such as DeepVariant (Poplin et al., 2018) and Clairvoyante (Luo et al., 2019) can accurately detect single nucleotide variants from similarly encoded images or tensors, few CNNs have been designed for the identification of CNVs. The key advantage of CNV-Net over other methods is the application of deep learning, which allows learning features with higher complexity from DNA sequencing data. We train the network with the Genome in a Bottle Consortium NA12878 dataset (Pendleton et al., 2015) , and demonstrate CNV-Net performance on a high-quality benchmarking dataset acquired from a well-studied cell line, HG002 (Zook et al., 2020) . We also compare its performance to that of previous methods.

2. METHODS

CNV-Net first encodes sequencing information into pileup images and classifies each image as a deletion, duplication, or false positive (not a CNV). In this section, we describe how we generate pileup images, that is, the input to CNV-Net, as well as the design of the CNN.

2.1. PILEUP IMAGE GENERATION

We convert the information of mapped reads from DNA sequencing data (provided in BAM file format Each 201×55 pixel pileup image, thus, captures coverage and reads specific to the 201-base region of the reference genome such that the sequencing data may be re-created from a series of consecutive pileup images. If there are more than 45 reads mapping to a specific region, those reads are discarded; we instead compensate by capturing this information through the rows that encodes coverage. If no reads map to the region, the image is left blank below the first five rows for the reference genome.



) and reference genome information to 201×55 pixel image representation of 201-base wide regions. Training data pileup images contain a CNV breakpoint (start or end position) centered in the image, capturing the 100 positions left and right each CNV breakpoint. Using the RGB channels of an image, this snapshot is able to capture features of the reads at specific locations in the reference genome.

Figure 1: 201×55 pixel pileup image. Feature rows are labeled by brackets and RGB channels are labeled by curly braces. Reference rows are encoded in R channel by base A: 250, G: 180, T: 100, C: 30, N: 0. Coverage rows are encoded in RG channels by the number of mapped reads covering each position. Reads are encoded in RGB channels by base types, base quality, and strand directionality positive: 70, negative: 240.

