CALIBRATED ADVERSARIAL REFINEMENT FOR STOCHASTIC SEMANTIC SEGMENTATION

Abstract

Ambiguities in images or unsystematic annotation can lead to multiple valid solutions in semantic segmentation. To learn a distribution over predictions, recent work has explored the use of probabilistic networks. However, these do not necessarily capture the empirical distribution accurately. In this work, we aim to learn a multimodal predictive distribution, where the empirical frequency of the sampled predictions closely reflects that of the corresponding labels in the training set. To this end, we propose a novel two-stage, cascaded strategy for calibrated adversarial refinement. In the first stage, we explicitly model the data with a categorical likelihood. In the second, we train an adversarial network to sample from it an arbitrary number of coherent predictions. The model can be used independently or integrated into any black-box segmentation framework to facilitate learning of calibrated stochastic mappings. We demonstrate the utility and versatility of the approach by attaining state-of-the-art results on the multigrader LIDC dataset and a modified Cityscapes dataset. In addition, we use a toy regression dataset to show that our framework is not confined to semantic segmentation, and the core design can be adapted to other tasks requiring learning a calibrated predictive distribution.

1. INTRODUCTION

Real-world datasets are often riddled with ambiguities, allowing for multiple valid solutions for a given input. These can emanate from an array of sources, such as ambiguous label space (Lee et al., 2016) , sensor noise, occlusions, and inconsistencies or errors during manual data annotation. Despite this problem, the majority of the research encompassing semantic segmentation focuses on optimising models that assign a single prediction to each input image (Ronneberger et al., 2015; Jégou et al., 2017; Takikawa et al., 2019; Chen et al., 2017a; b; 2016a; b; 2015) . These are often incapable of capturing the entire empirical distribution of outputs. Moreover, since they optimise for a one-fits-all solution, noisy labels can lead to incoherent predictions and therefore compromise their reliability (Lee et al., 2016) . Ideally, in such situations one would use a model that can sample multiple consistent hypotheses, capturing the different modalities of the ground truth distribution, and leverage uncertainty information to identify potential errors in each. Further, the sampled predictions should accurately reflect the occurrence frequencies of the labels in the training set; that is, the predictive distribution should be calibrated (Guo et al., 2017; Kull et al., 2019) . Such a system would be particularly useful for hypothesis-driven reasoning in human-in-the-loop semi-automatic settings. For instance, large scale manual annotation of segmentation map is very labour-intensive-each label in the Cityscapes dataset takes on average 1.5 hours to annotate (Cordts et al., 2016) . Alternatively, having a human operator manually selecting from a set of automatically generated label proposals could accelerate this process dramatically. In addition, combining uncertainty estimates with sampling of self-consistent labels, can be used to focus the annotator's attention to ambiguous regions, where errors are likely to occur, thereby improving safety. Several approaches have been proposed to capture label multimodality in image-to-image translation tasks (Huang et al., 2018; Lee et al., 2018; Zhu et al., 2017a; Bao et al., 2017; Zhang, 2018) , with only a few of them applied on stochastic semantic segmentation (Kohl et al., 2018; 2019; Baumgartner et al., 2019; Hu et al., 2019; Kamnitsas et al., 2017; Rupprecht et al., 2017; Bhattacharyya et al., 2018) . These methods have the capacity to learn a diverse set of labels for each input, however, they are either limited to a fixed number of samples (Kamnitsas et al., 2017; Rupprecht et al., 2017) , return uncalibrated predictions, or do not account for uncertainty. In this work, we tackle all three challenges by introducing a two-stage cascaded strategy. In the first stage we estimate pixelwise class probabilities and in the second, we sample confident predictions, calibrated relatively to the distribution predicted in the first stage. This allows us to obtain both uncertainty estimates as well as self-consistent label proposals. The key contributions arefoot_0 : • We propose a novel cascaded architecture that constructively combines explicit likelihood modelling with adversarial refinement to sample an arbitrary number of confident, and self-consistent predictions given an input image. • We introduce a novel loss term that facilitates learning of calibrated stochastic mappings when using adversarial neural networks. To our knowledge this is the first work to do so. 2018) identify the maximum likelihood learning objective as the cause for this phenomenon in dropout Bayesian neural networks (Gal and Ghahramani, 2016b) . They postulate that under cross entropy optimisation, all sampled models are forced to explain all the data, and thereby converge to the mean solution. To counter that, they propose to replace the cross entropy with and adversarial loss term parametrising a synthetic likelihood (Rosca et al., 2017) , thereby making it conducive to multimodality. In contrast to this method, our approach is simpler to implement as it is not cast in the framework of variational Bayes, which requires the specification of weight priors and variational distribution family. (2018) to improve the diversity of the samples by modelling the data on several scales of the image resolution. Nonetheless, these methods do not explicitly calibrate the predictive distribution in the pixel-space, and consequently do not provide reliable aleatoric uncertainty estimates (Kendall and Gal, 2017; Choi et al., 2018; Gustafsson et al., 2019) . Hu et al. (2019) address this shortcoming by using the intergrader variability as additional supervision. A major limitation of this approach is the requirement of a-priori knowledge of all the modalities of the data distribution. For many real-world datasets, however, this information is not readily available. In the more general domain of image-to-image translation, alternative methods employ hybrid models that use adversarially trained cVAEs (Zhu et al., 2017a; Bao et al., 2017) to learn a distribution over a latent code, capturing multimodality, in order to sample diverse and coherent predictions. A common hurdle in conditional generative adversarial network (cGAN) approaches is that simply incorporating a noise vector as an additional input often results in mode collapse. This occurs due to the lack of regularisation between the noise input and generator output, allowing the generator to learn to ignore the noise vector (Isola et al., 2017) . This issue is commonly resolved by using supplementary cycle-consistency losses (Huang et al., 2018; Lee et al., 2018; Zhu et al., 2017a; Bao et al., 2017) , as proposed by Zhu et al. (2017b) or with alternative regularisation losses on the generator (Yang et al., 2018) . However, none of these methods explicitly address the challenge of calibrating the predictive distribution.



Code is publicly available at <URL OMITTED FOR ANONYMITY>



Kohl et al. (2018)  take an orthogonal approach in combining a U-Net(Ronneberger et al., 2015)  with a conditional variational autoencoder (cVAE) (Kingma and Welling, 2013) to learn a distribution over semantic labels. InKohl et al. (2019) and Baumgartner et al. (2019)  the authors build on Kohl et al.

• The proposed model can be trained independently or used to augment any pretrained blackbox semantic segmentation model, endowing it with a multimodal predictive distribution.

