MODELING MULTIMODAL ALEATORIC UNCERTAINTY IN SEGMENTATION WITH MIXTURE OF STOCHASTIC EXPERTS

Abstract

Equipping predicted segmentation with calibrated uncertainty is essential for safety-critical applications. In this work, we focus on capturing the data-inherent uncertainty (aka aleatoric uncertainty) in segmentation, typically when ambiguities exist in input images. Due to the high-dimensional output space and potential multiple modes in segmenting ambiguous images, it remains challenging to predict well-calibrated uncertainty for segmentation. To tackle this problem, we propose a novel mixture of stochastic experts (MoSE) model, where each expert network estimates a distinct mode of the aleatoric uncertainty and a gating network predicts the probabilities of an input image being segmented in those modes. This yields an efficient two-level uncertainty representation. To learn the model, we develop a Wasserstein-like loss that directly minimizes the distribution distance between the MoSE and ground truth annotations. The loss can easily integrate traditional segmentation quality measures and be efficiently optimized via constraint relaxation. We validate our method on the LIDC-IDRI dataset and a modified multimodal Cityscapes dataset. Results demonstrate that our method achieves the state-of-the-art or competitive performance on all metrics. 1

1. INTRODUCTION

Semantic segmentation, a core task in computer vision, has made significant progress thanks to the powerful representations learned by deep neural networks. The majority of existing work focus on generating a single (or sometimes a fixed number of) segmentation output(s) to achieve high pixel-wise accuracy (Minaee et al., 2021) . Such a prediction strategy, while useful in many scenarios, typically disregards the predictive uncertainty in the segmentation outputs, even when the input image may contain ambiguous regions. Equipping predicted segmentation with calibrated uncertainty, however, is essential for many safety-critical applications such as medical diagnostics and autonomous driving to prevent problematic low-confidence decisions (Amodei et al., 2016) . An important problem of modeling predictive uncertainty in semantic segmentation is to capture aleatoric uncertainty, which aims to predict multiple possible segmentation outcomes with calibrated probabilities when there exist ambiguities in input images (Monteiro et al., 2020) . In lung nodule segmentation, for example, an ambiguous image can either be annotated with a large nodule mask or non-nodule with different probabilities (Armato III et al., 2011) . Such a problem can be naturally formulated as label distribution learning (Geng & Ji, 2013) , of which the goal is to estimate the conditional distribution of segmentation given input image. Nonetheless, due to the high-dimensional output space, typical multimodal characteristic of the distributions and limited annotations, it remains challenging to predict well-calibrated uncertainty for segmentation. Most previous works (Kohl et al., 2018; 2019; Baumgartner et al., 2019; Hu et al., 2019) adopt the conditional variational autoencoder (cVAE) framework (Sohn et al., 2015) to learn the predictive distribution of segmentation outputs, which has a limited capability to capture the multimodal distribution due to the over-regularization from the Gaussian prior or posterior collapse (Razavi et al., 2018; Qiu & Lui, 2021) . Recently, Monteiro et al. (2020) propose a multivariate Gaussian model in the logit space of pixel-wise classifiers. However, it has to use a low-rank covariance matrix for computational efficiency, which imposes a restriction on its modeling capacity. To alleviate this problem, Kassapis et al. (2021) employ the adversarial learning strategy to learn an implicit density model, which requires additional loss terms to improve training stability (Luc et al., 2016; Samson et al., 2019) . Moreover, all those methods have to sample a large number of segmentation outputs to represent the predictive distribution, which can be inefficient in practice. In this work, we aim to tackle the aforementioned limitations by explicitly modeling the multimodal characteristic of the segmentation distribution. To this end, we propose a novel mixture of stochastic experts (MoSE) (Masoudnia & Ebrahimpour, 2014) framework, in which each expert network encodes a distinct mode of aleatoric uncertainty in the dataset and its weight represents the probability of an input image being annotated by a segmentation sampled from that mode. This enables us to decompose the output distribution into two granularity levels and hence provides an efficient and more interpretable representation for the uncertainty. Moreover, we formulate the model learning as an Optimal Transport (OT) problem (Villani, 2009) , and design a Wasserstein-like loss that directly minimizes the distribution distance between the MoSE and ground truth annotations. Specifically, our MoSE model comprises a set of stochastic networks and a gating network. Given an image, each expert computes a semantic representation via a shared segmentation network, which is fused with a latent Gaussian variable to generate segmentation samples of an individual mode. The gating network takes a semantic feature embedding of the image and predicts the mode probabilities. As such, the output distribution can be efficiently represented by a set of samples from the experts and their corresponding probabilities. To learn the model, we relax the original OT formulation and develop an efficient bi-level optimization procedure to minimize the loss. We validate our method on the LIDC-IDRI dataset (Armato III et al., 2011) and a modified multimodal Cityscapes dataset (Cordts et al., 2016; Kohl et al., 2018) . Results demonstrate that our method achieves the state-of-the-art or competitive performance on all metrics. To summarize, our main contribution is three-folds: (i) We propose a novel mixture of stochastic experts model to capture aleatoric uncertainty in segmentation; (ii) We develop a Wasserstein-like loss and constraint relaxation for efficient learning of uncertainty; (iii) Our method achieves the state-of-the-art or competitive performance on two challenging segmentation benchmarks.

2. RELATED WORK

Aleatoric uncertainty in semantic segmentation Aleatoric uncertainty refers to the uncertainty inherent in the observations and can not be explained away with more data (Hüllermeier & Waegeman, 2021) . Early works mainly focus on the classification task, modeling such uncertainty by exploiting Dirichlet prior (Malinin & Gales, 2018; 2019) or post-training with calibrated loss (Sensoy et al., 2018; Guo et al., 2017; Kull et al., 2019) . In semantic segmentation, due to the highdimensional structured output space, aleatoric uncertainty estimation typically employs two kinds of strategies. The first kind (Kendall & Gal, 2017; Wang et al., 2019; Ji et al., 2021) ignore the structure information and may suffer from inconsistent estimation. In the second line of research, Kohl et al. Optimal Transport and Wasserstein distance The optimal transport problem studies how to transfer the mass from one distribution to another in the most efficient way (Villani, 2009) . The corresponding minimum distance is known as Wasserstein distance or Earth Mover's distance in



Code is available at https://github.com/gaozhitong/MoSE-AUSeg.



(2018)  first propose to learn a conditional generative model via the cVAE framework(Sohn et al.,  2015; Kingma & Welling, 2014), which can produce a joint distribution of the structured outputs. It is then improved by Baumgartner et al. (2019); Kohl et al. (2019) with hierarchical latent code injection. Other progresses based on the cVAE framework attempt to introduce an additional calibration loss on the sample diversity (Hu et al., 2019), change the latent code to discrete (Qiu & Lui, 2021), or augment the posterior density with Normalizing Flows (Valiuddin et al., 2021). Recently, Monteiro et al. (2020) directly model the joint distribution among pixels in the logit space with a multivariate Gaussian distribution with low-rank parameterization, while Kassapis et al. (2021) adopt a cascaded framework and an adversarial loss for model learning. Different from those methods, we adopt an explicit multimodal framework training with a Wasserstein-like loss, which enables us to capture the multimodal characteristic of the label space effectively.

