HUMAN PERCEPTION-BASED EVALUATION CRITE-RION FOR ULTRA-HIGH RESOLUTION CELL MEM-BRANE SEGMENTATION

Abstract

Computer vision technology is widely used in biological and medical data analysis and understanding. However, there are still two major bottlenecks in the field of cell membrane segmentation, which seriously hinder further research: lack of sufficient high-quality data and lack of suitable evaluation criteria. In order to solve these two problems, this paper first introduces an Ultra-high Resolution Image Segmentation dataset for the Cell membrane, called U-RISC, the largest annotated Electron Microscopy (EM) dataset for the Cell membrane with multiple iterative annotations and uncompressed high-resolution raw data. During the analysis process of the U-RISC, we found that the current popular segmentation evaluation criteria are inconsistent with human perception. This interesting phenomenon is confirmed by a subjective experiment involving twenty people. Furthermore, to resolve this inconsistency, we propose a new evaluation criterion called Perceptual Hausdorff Distance (PHD) to measure the quality of cell membrane segmentation results. Detailed performance comparison and discussion of classic segmentation methods along with two iterative manual annotation results under existing evaluation criteria and PHD is given.

1. INTRODUCTION

Electron Microscopy (EM) is a powerful tool to explore ultra-fine structures in biological tissues, which has been widely used in the research areas of medicine and biology ( ERLANDSON (2009) ; Curry et al. (2006) ; Harris et al. (2006) ). In recent years, EM techniques have pioneered an emerging field called "Connectomics" (Lichtman et al. (2014) ), which aims to scan and reconstruct the whole brain circuitry at the nanoscale. "Connectomics" has played a key role in several ambitious projects, including the BRAIN Initiative ( Insel et al. (2013) ) and MICrONS ( Gleeson & Sawyer (2018) ) in the U.S., Brain/MINDS in Japan ( Dando (2020)), and the China Brain Project ( Poo et al. (2016) ). Because EM scans brain slices at the nanoscale, it produces massive images with ultra-high resolution and inevitably leads to the explosion of data. However, compared to the advances of EM, techniques of data analysis fall far behind. In particular, how to automatically extract information from massive raw data to reconstruct the circuitry map has growingly become the bottleneck of EM applications. Despite much progress that has been made in cell membrane segmentation for EM data thanks to deep learning, one risk to these popular and classic methods is that they might be "saturated" at the current datasets as their performance appear to be "exceedingly accurate" ( Lee et al. ( 2017)). How can these classic deep learning based segmentation methods work on new EM datasets with higher resolution and perhaps more challenges? Moreover, how robust of these methods when they are compared with human performance on such EM images?

One critical step in automatic EM

To expand the research of membrane segmentation on more comprehensive EM data, we first established a dataset "U-RISC" containing images with original resolution (10000 × 10000 pixels, Fig. 1 ). To ensure the quality of annotation, it also costs us over 10,000 labor hours to label and double-check the data. To the best of our knowledge, U-RISC is the largest uncompressed annotated and EM dataset today. Next, we tested several classic deep learning based segmentation methods on U-RISC and compared the results to human performance. We found that the performance of these methods was much lower than that of the first annotation. To understand why human perception is better than the popular segmentation methods, we examined in detail the Membrane segmentation results by these popular segmentation methods. How to measure the similarity between two image segmentation results has been widely discussed ( Yeghiazaryan & Voiculescu ( 2018 2015a)), they also considered multiple metrics like Rand score on both original images and thinned images. However, we found there was a certain inconsistency between current most popular evaluation criteria for segmentation(e.g. F1 score, IoU) and human perception: while some figures were rated significantly lower in F1 score or IoU, they were "perceived" better by humans (Fig. 4 ). Such inconsistency motivated us to propose a human-perception based criterion, Perceptual Hausdorff Distance (PHD) to evaluate image qualities. Further, we set up a subjective experiment to collect human perception about membrane segmentation, and we found the PHD criteria is more consistent with human choices than traditional evaluation criteria. Finally, we found the current popular and classical segmentation methods need to be revisited with PHD criteria. Overall, our contribution in this work lies mainly in the following two parts: (1) we established the largest, original image resolution-based EM dataset for training and testing; (2) we proposed a human-perception based evaluation criterion, PHD, and verified the superiority of PHD by subjective experiments. The dataset we contributed and the PHD criterion we proposed may help researchers to gain insights into the difference between human perception and conventional evaluation criteria, thus motivate the further design of the segmentation method to catch up with the human performance on original EM images.

2. U-RISC: ULTRA-HIGH RESOLUTION IMAGE SEGMENTATION DATASET FOR CELL MEMBRANE

Supervised learning methods rely heavily on high-quality datasets. To alleviate the lack of training data for cell membrane segmentation, we proposed an Ultra-high Resolution Image Segmentation dataset for Cell membrane, called U-RISC. The dataset was annotated upon RC1, a large scale retinal serial section transmission electron microscopic (ssTEM) dataset, publically available upon request and described in the work of Anderson et al. (2011) . The original RC1 dataset is a 0.25mm diameter, 370 TEM slices volume, spanning the inner nuclear, inner plexiform, and ganglion cell layers, acquired at 2.18 nm/pixel across both axes and 70nm thickness in z-axis. From the 370 serial-section volume, we clipped out 120 images in the size of 10000 ×10000 pixels from randomly chosen sections. Then, we manually annotated the cell membranes in an iterative annotation-correction procedure. Since the human labeling process is very valuable for uncovering the human learning process, during the relabeling process, we reserved the intermediate results for public release. The U-RISC dataset will be released on https://Anonymous.com on acceptance. 



data analysis is Membrane segmentation. With the introduction of deep learning techniques, significant improvements have been achieved in several public available EM datasets ISBI 2012 and SNEMI3D ( ISBI 2012 (2012); ISBI 2013 (2013); Arganda-Carreras et al. (2015b); Lee et al. (2017)). One of the earliest works ( Ciresan et al. (2012) used a succession of max-pooling convolutional networks as a pixel classifier, which estimated the probability of a pixel is a membrane. Ronneberger et al. (2015) presented a U-Net structure with contracting paths, which captures multi-contextual information. Fully convolutional networks (FCNs) proposed by Long et al. (2015) led to a breakthrough in semantic segmentation. Follow-up works based on Unet and FCN structure ( Xie & Tu (2015); Drozdzal et al. (2016); Hu et al. (2018); Zhou et al. (2018); Chaurasia & Culurciello (2017); Yu et al. (2017); Chen et al. (2019b)) have also achieved outstanding results near-human performance.

); Niessen et al. (2000); Veltkamp & Hagedoorn (2000); Lee et al. (2017)). Varduhi Yeghiazaryan ( Yeghiazaryan & Voiculescu (2018)) discussed the family of boundary overlap metrics for the evaluation of medical image segmentation. Veltkamp. etc ( Veltkamp & Hagedoorn (2000)) formulated and summed up the similarity measures in a more general condition. In some challenges, such as ISBI2012 ( Arganda-Carreras et al. (

COMPARISON WITH OTHER DATASETS   ISBI 2012 (Cardona et al. (2010)) published a set of 30 images for training, which were captured from the ventral nerve cord of a Drosophila first instar larva at a resolution of 4×4×50 nm/pixel through ssTEM(Arganda-Carreras et al. (2015b); ISBI 2012 ISBI   (2012))). Each image contains 512×512 pixels, spanning a realistic area of 2×2 µm approximately. In the challenge of SNEMI3D (Kasthuri

