THAT LABEL'S GOT STYLE: HANDLING LABEL STYLE BIAS FOR UNCERTAIN IMAGE SEGMENTATION

Abstract

Segmentation uncertainty models predict a distribution over plausible segmentations for a given input, which they learn from the annotator variation in the training set. However, in practice these annotations can differ systematically in the way they are generated, for example through the use of different labeling tools. This results in datasets that contain both data variability and differing label styles. In this paper, we demonstrate that applying state-of-the-art segmentation uncertainty models on such datasets can lead to model bias caused by the different label styles. We present an updated modelling objective conditioning on labeling style for aleatoric uncertainty estimation, and modify two state-of-the-art-architectures for segmentation uncertainty accordingly. We show with extensive experiments that this method reduces label style bias, while improving segmentation performance, increasing the applicability of segmentation uncertainty models in the wild. We curate two datasets, with annotations in different label styles, which we will make publicly available along with our code upon publication.

1. INTRODUCTION

Image segmentation is a fundamental task in computer vision and biomedical image processing. As part of the effort to create safe and interpretable ML systems, the quantification of segmentation uncertainty has thus become a crucial task as well. While different sources and therefore different types of uncertainties can be distinguished (Kiureghian & Ditlevsen, 2009; Gawlikowski et al., 2021) , research has mainly focused on modelling two types: aleatoric and epistemic uncertainty. While epistemic uncertainty mainly refers to model uncertainty due to missing training data, aleatoric uncertainty arises through variability inherent to the data, caused for example by different opinions of the annotators about the presence, position and boundary of an object. Since aleatoric uncertainty estimates are directly inferred from the training data it is important that the variability in the available ground-truth annotations represents the experts' disagreement. However, in practice the annotations might vary in systematic ways, caused by differing labeling tools or different labeling instructions. Especially in settings where opinions in form of annotations are sourced from experts in different institutions, datasets can be heterogeneous in the way the labels are generated. Even in the best of cases, in which manual annotators are given detailed instructions on how to segment objects, they will still have to make choices on how to annotate in ambiguous parts of the image. Moreover, annotators are not always carefully trained, and may not have access to the same Figure 1 : Sample image the from PhC-U373 dataset with annotations (red). The first three annotators were instructed to delineate the boundary in detail, whereas the last three annotators were instructed to provide a coarser and faster annotation. labeling tools. As a result, individual choices and external factors affect how annotations are made; we term this label style. Figure 1 shows an example of how annotations may vary in label style. Label style can also depend on label cost: While detailed annotations are desirable, they also take more time, and one might desire to train models on cheaper, less detailed annotations. In the example of Fig. 1 , we have access to both detailed and coarse, or weak, annotations. It is not clear that adding the weaker annotations will necessarily improve performance; removing them to train on fewer but higher quality annotations could also be beneficial. While weak annotations carry less precise information about the segmentation boundary, they do carry information about the annotator's beliefs concerning the presence and rough location of an object. Exploiting this information could improve the annotator distribution learned by the model, even tough the target might not be delineated in a detailed way. In practice, however, neither datasets nor models distinguish between variations in label style and variations in the data. As a result, current methods for segmentation uncertainty run the risk of being biased by this difference in label style.

1.1. CONTRIBUTION

In this paper, we demonstrate that applying state-of-the-art models on datasets that contain differing label styles can lead to systematic over-segmentation. We show how this bias can be reduced by stating an updated modelling objective for aleatoric uncertainty estimation conditioned on label style. We adjust two state-of-the-art uncertainty segmentation architectures accordingly, presenting conditioned versions of the Probabilistic U-net (Kohl et al., 2018) and the Stochastic Segmentation Networks (Monteiro et al., 2020) that fit to the updated modelling objective and can be trained on datasets containing differing label styles. We compare the proposed method against the common strategy of removing the annotations of a weaker label style from the dataset. We curate two datasets, both with annotations in different label styles, ranging from detailed, close crops to over-segmented outlines. In a series of experiments, we show that the conditioned models outperform standard models, trained on either all or a single label style. The conditioning reduces label style bias, improves overall segmentation accuracy and enables more precise flagging of probable segmentation errors. Our results stress that including all label styles using a conditioned model enables fully leveraging all labels in a dataset, as opposed to naively excluding weaker label styles. As such, our model contributes to increasing the applicability of uncertainty segmentation models in practice. Our code and curated datasets will be made publicly available, to enable the community to further assess models for segmentation uncertainty in the scenario with differing label styles.

2. BACKGROUND AND RELATED WORK

Uncertainties in deep learning in general, and image segmentation in particular, can be studied under the Bayesian framework (Bishop, 2006; Kendall & Gal, 2017) . Let D = (X, A) be a dataset of N images x n ∈ X with S pixels each, where each image x n is associated with k ground-truth annotations a k n ∈ A, drawn from the unknown annotator distribution p(a|x n ). Furthermore, let f (x, θ) denote a model of p(a|x) defined by parameters θ. Formulating the segmentation task in a Bayesian way, we seek to model the probability distribution p(y|x) over model predictions y given an image x to be as similar as possible to the annotator distribution p(a|x). This predictive distribution can be decomposed into the two types of uncertainty (Kiureghian & Ditlevsen, 2009) (1) After observing the data D during training, the posterior distribution p(θ|D) describes a density over the parameter space of the model, capturing epistemic uncertainty. The distribution p(y|x, θ), on the other hand, captures the variation in the data and possible model predictions, i.e., aleatoric uncertainty. Due to the typically intractable epistemic uncertainty distribution, the integral on the right hand side of equation 1 is usually not accessible. Therefore, it is of particular interest to develop suitable approximations of the predictive distribution or parts of the integral in 1, and various image segmentation approaches and models have been proposed for this purpose (Kohl et al., 2018; Monteiro et al., 2020; Kohl et al., 2019; Baumgartner et al., 2019) .

