EFFICIENTLY TROUBLESHOOTING IMAGE SEGMENTA-TION MODELS WITH HUMAN-IN-THE-LOOP Anonymous

Abstract

Image segmentation lays the foundation for many high-stakes vision applications such as autonomous driving and medical image analysis. It is, therefore, of great importance to not only improve the accuracy of segmentation models on wellestablished benchmarks, but also enhance their robustness in the real world so as to avoid sparse but fatal failures. In this paper, instead of chasing state-of-the-art performance on existing benchmarks, we turn our attention to a new challenging problem: how to efficiently expose failures of "top-performing" segmentation models in the real world and how to leverage such counterexamples to rectify the models. To achieve this with minimal human labelling effort, we first automatically sample a small set of images that are likely to falsify the target model from a large corpus of web images via the maximum discrepancy competition principle. We then propose a weakly labelling strategy to further reduce the number of false positives, before time-consuming pixel-level labelling by humans. Finally, we fine-tune the model to harness the identified failures, and repeat the whole process, resulting in an efficient and progressive framework for troubleshooting segmentation models. We demonstrate the feasibility of our framework using the semantic segmentation task in PASCAL VOC, and find that the fine-tuned model exhibits significantly improved generalization when applied to real-world images with greater content diversity. All experimental codes will be publicly released upon acceptance.

1. INTRODUCTION

Image segmentation (i.e., pixel-level image labelling) has recently risen to explosive popularity, due in part to its profound impact on many high-stakes vision applications, such as autonomous driving and medical image analysis. While the performance of segmentation models, as measured by excessively reused test sets (Everingham et al., 2010; Lin et al., 2014 ), keeps improving (Chen et al., 2018a; Badrinarayanan et al., 2017; Yu et al., 2018) , two scientific questions have arisen to capture the community's curiosity, and motivate the current work: Q1: Do "top-performing" segmentation models on existing benchmarks generalize to the real world with much richer variations? Q2: Can we identify and rectify the trained models' sparse but fatal mistakes, without incurring significant workload of human labelling? The answer to the first question is conceptually clearer, by taking reference to a series of recent work on image classification (Recht et al., 2019; Hendrycks et al., 2019) . A typical test set for image classification can only include a maximum of ten thousands of images because human labelling (or verification of predicted labels) is expensive and time-consuming. Considering the high dimensionality of image space and the "human-level" performance of existing methods, such test sets may only spot an extremely small subset of possible mistakes that the model will make, suggesting their insufficiency to cover hard examples that may be encountered in the real world (Wang et al., 2020) . The existence of natural adversarial examples (Hendrycks et al., 2019 ) also echos such hidden fragility of the classifiers to unseen examples, despite the impressive accuracy on existing benchmarks. While the above problem has not been studied in the context of image segmentation, we argue that it would only be much amplified for two main reasons. First, segmentation benchmarks require

