CALIBRATED ADVERSARIAL REFINEMENT FOR STOCHASTIC SEMANTIC SEGMENTATION

Abstract

Ambiguities in images or unsystematic annotation can lead to multiple valid solutions in semantic segmentation. To learn a distribution over predictions, recent work has explored the use of probabilistic networks. However, these do not necessarily capture the empirical distribution accurately. In this work, we aim to learn a multimodal predictive distribution, where the empirical frequency of the sampled predictions closely reflects that of the corresponding labels in the training set. To this end, we propose a novel two-stage, cascaded strategy for calibrated adversarial refinement. In the first stage, we explicitly model the data with a categorical likelihood. In the second, we train an adversarial network to sample from it an arbitrary number of coherent predictions. The model can be used independently or integrated into any black-box segmentation framework to facilitate learning of calibrated stochastic mappings. We demonstrate the utility and versatility of the approach by attaining state-of-the-art results on the multigrader LIDC dataset and a modified Cityscapes dataset. In addition, we use a toy regression dataset to show that our framework is not confined to semantic segmentation, and the core design can be adapted to other tasks requiring learning a calibrated predictive distribution.

1. INTRODUCTION

Real-world datasets are often riddled with ambiguities, allowing for multiple valid solutions for a given input. These can emanate from an array of sources, such as ambiguous label space (Lee et al., 2016) , sensor noise, occlusions, and inconsistencies or errors during manual data annotation. Despite this problem, the majority of the research encompassing semantic segmentation focuses on optimising models that assign a single prediction to each input image (Ronneberger et al., 2015; Jégou et al., 2017; Takikawa et al., 2019; Chen et al., 2017a; b; 2016a; b; 2015) . These are often incapable of capturing the entire empirical distribution of outputs. Moreover, since they optimise for a one-fits-all solution, noisy labels can lead to incoherent predictions and therefore compromise their reliability (Lee et al., 2016) . Ideally, in such situations one would use a model that can sample multiple consistent hypotheses, capturing the different modalities of the ground truth distribution, and leverage uncertainty information to identify potential errors in each. Further, the sampled predictions should accurately reflect the occurrence frequencies of the labels in the training set; that is, the predictive distribution should be calibrated (Guo et al., 2017; Kull et al., 2019) . Such a system would be particularly useful for hypothesis-driven reasoning in human-in-the-loop semi-automatic settings. For instance, large scale manual annotation of segmentation map is very labour-intensive-each label in the Cityscapes dataset takes on average 1.5 hours to annotate (Cordts et al., 2016) . Alternatively, having a human operator manually selecting from a set of automatically generated label proposals could accelerate this process dramatically. In addition, combining uncertainty estimates with sampling of self-consistent labels, can be used to focus the annotator's attention to ambiguous regions, where errors are likely to occur, thereby improving safety. Several approaches have been proposed to capture label multimodality in image-to-image translation tasks (Huang et al., 2018; Lee et al., 2018; Zhu et al., 2017a; Bao et al., 2017; Zhang, 2018) , with only a few of them applied on stochastic semantic segmentation (Kohl et al., 2018; 2019; Baumgartner et al., 2019; Hu et al., 2019; Kamnitsas et al., 2017; Rupprecht et al., 2017; Bhattacharyya et al., 2018) . These methods have the capacity to learn a diverse set of labels for each input, however, they

