CUT OUT THE ANNOTATOR, KEEP THE CUTOUT: BETTER SEGMENTATION WITH WEAK SUPERVISION

Abstract

Constructing large, labeled training datasets for segmentation models is an expensive and labor-intensive process. This is a common challenge in machine learning, addressed by methods that require few or no labeled data points such as few-shot learning (FSL) and weakly-supervised learning (WS). Such techniques, however, have limitations when applied to image segmentation-FSL methods often produce noisy results and are strongly dependent on which few datapoints are labeled, while WS models struggle to fully exploit rich image information. We propose a framework that fuses FSL and WS for segmentation tasks, enabling users to train high-performing segmentation networks with very few hand-labeled training points. We use FSL models as weak sources in a WS framework, requiring a very small set of reference labeled images, and introduce a new WS model that focuses on key areas-areas with contention among noisy labels-of the image to fuse these weak sources. Empirically, we evaluate our proposed approach over seven well-motivated segmentation tasks. We show that our methods can achieve within 1.4 Dice points compared to fully supervised networks while only requiring five hand-labeled training points. Compared to existing FSL methods, our approach improves performance by a mean 3.6 Dice points over the next-best method.

1. INTRODUCTION

Automated image segmentation has seen rapid improvements with recent developments in deep learning (Li et al., 2018; Chen et al., 2017; Milletari et al., 2016) . Convolutional neural networks (CNNs) achieve high segmentation performance-but can require large, labeled training datasets. Acquiring training labels is laborious and slow, particularly for medical images where expert segmentation is often required in 3-or 4-dimensions. While large datasets and pre-trained networks exist for natural images, medical image segmentation is a more targeted task, typically requiring new training sets for every imaging modality, scanner type, anatomical structure, and patient population. Such difficulties abound, but the significant impact of improved medical image segmentation motivates tackling these challenges (Hesamian et al., 2019) . Many few-shot learning (FSL) approaches have been proposed to mitigate these difficulties by training networks using only a few labeled examples. For example, data augmentation can reduce the needed amount of labeled data by introducing additional variation into small, labeled training sets through operations such as affine transforms or learned transformations (e.g. deformations learned by GANs) (Zhao et al., 2019; Eaton-Rosen et al., 2018; C ¸ic ¸ek et al., 2016) . Many semisupervised approaches aim to learn a useful representation from unlabeled data then fine-tune on a few manually-annotated images (Chen et al., 2020; Bai et al., 2019; Chen et al., 2019; Chaitanya et al., 2020a) . Finally, other approaches aim to transfer knowledge learned from segmenting one class in order to segment a previously unseen class (Shaban et al., 2017; Rakelly et al., 2018; Roy et al., 2020; Ouyang et al., 2020) . However, these FSL approaches have limitations. For example, effective data augmentation often requires model-and task-specific transformations that are difficult

