THAT LABEL'S GOT STYLE: HANDLING LABEL STYLE BIAS FOR UNCERTAIN IMAGE SEGMENTATION

Abstract

Segmentation uncertainty models predict a distribution over plausible segmentations for a given input, which they learn from the annotator variation in the training set. However, in practice these annotations can differ systematically in the way they are generated, for example through the use of different labeling tools. This results in datasets that contain both data variability and differing label styles. In this paper, we demonstrate that applying state-of-the-art segmentation uncertainty models on such datasets can lead to model bias caused by the different label styles. We present an updated modelling objective conditioning on labeling style for aleatoric uncertainty estimation, and modify two state-of-the-art-architectures for segmentation uncertainty accordingly. We show with extensive experiments that this method reduces label style bias, while improving segmentation performance, increasing the applicability of segmentation uncertainty models in the wild. We curate two datasets, with annotations in different label styles, which we will make publicly available along with our code upon publication.

1. INTRODUCTION

Image segmentation is a fundamental task in computer vision and biomedical image processing. As part of the effort to create safe and interpretable ML systems, the quantification of segmentation uncertainty has thus become a crucial task as well. While different sources and therefore different types of uncertainties can be distinguished (Kiureghian & Ditlevsen, 2009; Gawlikowski et al., 2021) , research has mainly focused on modelling two types: aleatoric and epistemic uncertainty. While epistemic uncertainty mainly refers to model uncertainty due to missing training data, aleatoric uncertainty arises through variability inherent to the data, caused for example by different opinions of the annotators about the presence, position and boundary of an object. Since aleatoric uncertainty estimates are directly inferred from the training data it is important that the variability in the available ground-truth annotations represents the experts' disagreement. However, in practice the annotations might vary in systematic ways, caused by differing labeling tools or different labeling instructions. Especially in settings where opinions in form of annotations are sourced from experts in different institutions, datasets can be heterogeneous in the way the labels are generated. Even in the best of cases, in which manual annotators are given detailed instructions on how to segment objects, they will still have to make choices on how to annotate in ambiguous parts of the image. Moreover, annotators are not always carefully trained, and may not have access to the same Figure 1 : Sample image the from PhC-U373 dataset with annotations (red). The first three annotators were instructed to delineate the boundary in detail, whereas the last three annotators were instructed to provide a coarser and faster annotation. 1

