MODELING MULTIMODAL ALEATORIC UNCERTAINTY IN SEGMENTATION WITH MIXTURE OF STOCHASTIC EXPERTS

Abstract

Equipping predicted segmentation with calibrated uncertainty is essential for safety-critical applications. In this work, we focus on capturing the data-inherent uncertainty (aka aleatoric uncertainty) in segmentation, typically when ambiguities exist in input images. Due to the high-dimensional output space and potential multiple modes in segmenting ambiguous images, it remains challenging to predict well-calibrated uncertainty for segmentation. To tackle this problem, we propose a novel mixture of stochastic experts (MoSE) model, where each expert network estimates a distinct mode of the aleatoric uncertainty and a gating network predicts the probabilities of an input image being segmented in those modes. This yields an efficient two-level uncertainty representation. To learn the model, we develop a Wasserstein-like loss that directly minimizes the distribution distance between the MoSE and ground truth annotations. The loss can easily integrate traditional segmentation quality measures and be efficiently optimized via constraint relaxation. We validate our method on the LIDC-IDRI dataset and a modified multimodal Cityscapes dataset. Results demonstrate that our method achieves the state-of-the-art or competitive performance on all metrics. 1

1. INTRODUCTION

Semantic segmentation, a core task in computer vision, has made significant progress thanks to the powerful representations learned by deep neural networks. The majority of existing work focus on generating a single (or sometimes a fixed number of) segmentation output(s) to achieve high pixel-wise accuracy (Minaee et al., 2021) . Such a prediction strategy, while useful in many scenarios, typically disregards the predictive uncertainty in the segmentation outputs, even when the input image may contain ambiguous regions. Equipping predicted segmentation with calibrated uncertainty, however, is essential for many safety-critical applications such as medical diagnostics and autonomous driving to prevent problematic low-confidence decisions (Amodei et al., 2016) . An important problem of modeling predictive uncertainty in semantic segmentation is to capture aleatoric uncertainty, which aims to predict multiple possible segmentation outcomes with calibrated probabilities when there exist ambiguities in input images (Monteiro et al., 2020) . In lung nodule segmentation, for example, an ambiguous image can either be annotated with a large nodule mask or non-nodule with different probabilities (Armato III et al., 2011) . Such a problem can be naturally formulated as label distribution learning (Geng & Ji, 2013) , of which the goal is to estimate the conditional distribution of segmentation given input image. Nonetheless, due to the high-dimensional output space, typical multimodal characteristic of the distributions and limited annotations, it remains challenging to predict well-calibrated uncertainty for segmentation. Most previous works (Kohl et al., 2018; 2019; Baumgartner et al., 2019; Hu et al., 2019) adopt the conditional variational autoencoder (cVAE) framework (Sohn et al., 2015) to learn the predictive distribution of segmentation outputs, which has a limited capability to capture the multimodal distribution due to the over-regularization from the Gaussian prior or posterior collapse (Razavi et al., 



Code is available at https://github.com/gaozhitong/MoSE-AUSeg.1

