EFFICIENTLY TROUBLESHOOTING IMAGE SEGMENTA-TION MODELS WITH HUMAN-IN-THE-LOOP Anonymous

Abstract

Image segmentation lays the foundation for many high-stakes vision applications such as autonomous driving and medical image analysis. It is, therefore, of great importance to not only improve the accuracy of segmentation models on wellestablished benchmarks, but also enhance their robustness in the real world so as to avoid sparse but fatal failures. In this paper, instead of chasing state-of-the-art performance on existing benchmarks, we turn our attention to a new challenging problem: how to efficiently expose failures of "top-performing" segmentation models in the real world and how to leverage such counterexamples to rectify the models. To achieve this with minimal human labelling effort, we first automatically sample a small set of images that are likely to falsify the target model from a large corpus of web images via the maximum discrepancy competition principle. We then propose a weakly labelling strategy to further reduce the number of false positives, before time-consuming pixel-level labelling by humans. Finally, we fine-tune the model to harness the identified failures, and repeat the whole process, resulting in an efficient and progressive framework for troubleshooting segmentation models. We demonstrate the feasibility of our framework using the semantic segmentation task in PASCAL VOC, and find that the fine-tuned model exhibits significantly improved generalization when applied to real-world images with greater content diversity. All experimental codes will be publicly released upon acceptance.

1. INTRODUCTION

Image segmentation (i.e., pixel-level image labelling) has recently risen to explosive popularity, due in part to its profound impact on many high-stakes vision applications, such as autonomous driving and medical image analysis. While the performance of segmentation models, as measured by excessively reused test sets (Everingham et al., 2010; Lin et al., 2014 ), keeps improving (Chen et al., 2018a; Badrinarayanan et al., 2017; Yu et al., 2018) , two scientific questions have arisen to capture the community's curiosity, and motivate the current work: Q1: Do "top-performing" segmentation models on existing benchmarks generalize to the real world with much richer variations? Q2: Can we identify and rectify the trained models' sparse but fatal mistakes, without incurring significant workload of human labelling? The answer to the first question is conceptually clearer, by taking reference to a series of recent work on image classification (Recht et al., 2019; Hendrycks et al., 2019) . A typical test set for image classification can only include a maximum of ten thousands of images because human labelling (or verification of predicted labels) is expensive and time-consuming. Considering the high dimensionality of image space and the "human-level" performance of existing methods, such test sets may only spot an extremely small subset of possible mistakes that the model will make, suggesting their insufficiency to cover hard examples that may be encountered in the real world (Wang et al., 2020) . The existence of natural adversarial examples (Hendrycks et al., 2019 ) also echos such hidden fragility of the classifiers to unseen examples, despite the impressive accuracy on existing benchmarks. While the above problem has not been studied in the context of image segmentation, we argue that it would only be much amplified for two main reasons. First, segmentation benchmarks require pixel-level dense annotation. Compared to classification databases, they are much more expensive, laborious, and error-prone to label 1 , making existing segmentation datasets even more restricted in scale. Second, it is much harder for segmentation data to be class-balanced in the pixel level, making highly skewed class distributions notoriously common for this particular task (Kervadec et al., 2019; Bischke et al., 2018) . Besides, the "universal" background class (often set to cover the distracting or uninteresting classes (Everingham et al., 2010) ) adds additional complicacy to image segmentation (Mostajabi et al., 2015) . Thus, it remains questionable to what extent the impressive performance on existing benchmarks can be interpreted as (or translated into) real-world robustness. If "top-performing" segmentation models make sparse yet catastrophic mistakes that have not been spotted beforehand, they will fall short of the need by high-stakes applications. The answer to the second question constitutes the main body of our technical work. In order to identify sparse failures of existing segmentation models, it is necessary to expose them to a much larger corpus of real-world labelled images (on the order millions or even billions). This is, however, implausible due to the expensiveness of dense labelling in image segmentation. The core question essentially boils down to: how to efficiently decide what to label from the massive unlabelled images, such that a small number of annotated images maximally expose corner-case defects, and can be leveraged to improve the models. 



In this paper, we introduce a two-stage framework with human-in-the-loop for efficiently troubleshooting image segmentation models (see Figure1). The first stage automatically mines, from a large pool D of unlabelled images, a small image set M, which are the most informative in exposing weaknesses of the target model. Specifically, inspired by previous studies on model falsification as model comparison(Wang & Simoncelli, 2008; Ma et al., 2018; Wang et al., 2020), we let the target model compete with a set of state-of-the-art methods with different design methodologies, and sample images by MAximizing the Discrepancy (MAD) between the methods. To reduce the number of false positives, we propose a weakly labelling method of filtering M to obtain a smaller refined set S, subject to segmentation by human subjects. In the second stage, we fine-tune the target model to learn from the counterexamples in S without forgetting previously seen data. The two stages may be iterated, enabling progressive troubleshooting of image segmentation models. Experiments on PASCAL VOC(Everingham et al., 2010) demonstrate the feasibility of the proposed method to address this new challenging problem, where we successfully discover corner-case errors of a "top-performing" segmentation model(Chen et al., 2017), and fix it for improved generalization in the wild.2 RELATED WORKMAD competitionThe proposed method takes inspiration from the MAD competition(Wang & Simoncelli, 2008; Wang et al., 2020) to efficiently spot model failures. Previous works focused on performance evaluation. We take one step further to also fix the model errors detected in the MAD competition. To the best of our knowledge, our work is the first to extend the MAD idea to image segmentation, where labeling efficiency is more desired since pixel-wise human annotation for image segmentation is much more time-consuming than image quality assessment(Wang & Simoncelli, 2008) and image classification (Wang et al., 2020) tasks previously explored.1 According toEveringham et al. (Everingham et al., 2010) and our practice, it can easily take ten times as long to segment an object than to draw a bounding box around it.



Figure 1: Proposed framework for troubleshooting segmentation models.

