LEARNING TO SEGMENT FROM NOISY ANNOTATIONS: A SPATIAL CORRECTION APPROACH

Abstract

Noisy labels can significantly affect the performance of deep neural networks (DNNs). In medical image segmentation tasks, annotations are error-prone due to the high demand in annotation time and in the annotators' expertise. Existing methods mostly assume noisy labels in different pixels are i.i.d. However, segmentation label noise usually has strong spatial correlation and has prominent bias in distribution. In this paper, we propose a novel Markov model for segmentation noisy annotations that encodes both spatial correlation and bias. Further, to mitigate such label noise, we propose a label correction method to recover true label progressively. We provide theoretical guarantees of the correctness of the proposed method. Experiments show that our approach outperforms current stateof-the-art methods on both synthetic and real-world noisy annotations. 1

1. INTRODUCTION

Noisy annotations are inevitable in large scale datasets, and can heavily impair the performance of deep neural networks (DNNs) due to their strong memorization power (Zhang et al., 2016; Arpit et al., 2017) . Image segmentation also suffers from the label noise problem. For medical images, segmentation quality is highly dependent on human annotators' expertise and time spent. In practice, medical students and residents in training are often recruited to annotate, potentially introducing errors (Gurari et al., 2015; Kohli et al., 2017) . We also note even among experts, there can be poor consensus in terms of objects' location and boundary (Menze et al., 2014; Joskowicz et al., 2018; Zhang et al., 2020a) . Furthermore, segmentation annotations require pixel/voxel-level detailed delineations of the objects of interest. Annotating objects involving complex boundaries and structures are especially time-consuming. Thus, errors can naturally be introduced when annotating at scale. Segmentation is the first step of most analysis pipelines. Inaccurate segmentation can introduce error into measurements such as the morphology, which can be important for downstream diagnosis and prognostic tasks (Wang et al., 2019a; Nafe et al., 2005) . Therefore, it is important to develop robust training methods against segmentation label noise. However, despite many existing methods addressing label noise in classification tasks (Patrini et al., 2017; Yu et al., 2019; Zhang & Sabuncu, 2018; Li et al., 2020; Liu et al., 2020; Zhang et al., 2021; Xia et al., 2021) , limited progress has been made in the context of image segmentation. A few existing segmentation label noise approaches (Zhu et al., 2019; Zhang et al., 2020b; a) directly apply methods in classification label noise. However, these methods assume the label noise for each pixel is i.i.d. (independent and identically distributed). This assumption is not realistic in the segmentation context, where annotation is often done by brushes, and error is usually introduced near the boundary of objects. Regions further away from the boundary are less likely to be mislabeled (see Fig. 1c for an illustration). Therefore, in segmentation tasks, label noise of pixels has to be spatially correlated. An i.i.d. label noise will result in unrealistic annotations as in Fig. 1b . We propose a novel label noise model for segmentation annotations. Our model simulates the real annotation scenario, where an annotator uses a brush to delineate the boundary of an object. The noisy boundary can be considered a random yet continuous distortion of the true boundary. To capture this noise behavior, we propose a Markov process model. At each step of the process, two Bernoulli variables are used to control the expansion/shrinkage decision and the spatial-dependent expansion/shrinkage strength along the boundary. This model ensures the noisy label is a continuous distortion of the ground truth label along the boundary, as shown in Fig. 1c . Our model also includes a random flipping noise, which allows random (yet sparse) mislabels to appear even at regions far away from the boundary. Based on our Markov label noise, we propose a novel algorithm to recover the true labels by removing the bias. Since correcting model bias without any reference is almost impossible (Massart & Nédélec, 2006) , our algorithm requires a clean validation set, i.e., a set of wellcurated annotations, to estimate and correct the bias introduced due to label noise. We prove theoretically that only a small amount of validation data are needed to fully correct the bias and clean the noise. Empirically, we show that a single validation image annotation is enough for the bias correction; this is quite reasonable in practice. Furthermore, we generalize our algorithm to an iterative method that repeatedly trains a segmentation model and corrects labels, until convergence. Since our algorithm, called Spatial Correction (SC), is separate from the DNN training process, it is agnostic to the backbone DNN architecture, and can be combined with any segmentation model. On a variety of benchmarks, our method demonstrates superior performance over different state-of-the-art (SOTA) baselines. To summarize, our contribution is three-folds. • We propose a Markov model for segmentation label noise. To the best of our knowledge, this is the first noise model that is tailored for segmentation task and considers spatial correlation. • We propose an algorithm to correct the Markov label noise. Although a validation set is required to combat bias, we prove that the algorithm only needs a small amount of validation data to fully recover the clean labels. • We extend the algorithm to an iterative approach (SC) that can handle more general label noise in various benchmarks and we show that it outperforms SOTA baselines.

2. RELATED WORK

Methods in classification label noise can be categorized into two classes, i.e., model re-calibration and data re-calibration. Model re-calibration methods focus on training a robust network using given noisy labels. Some estimate a noise matrix through special designs of network architecture (Sukhbaatar et al., 2015; Goldberger & Ben-Reuven, 2017) or loss functions (Patrini et al., 2017; Hendrycks et al., 2018) . Some design loss functions that are robust to label noise (Zhang & Sabuncu, 2018; Wang et al., 2019b; Liu & Guo, 2020; Lyu & Tsang, 2020; Ma et al., 2020) . For example, generalized cross entropy (GCE) (Zhang & Sabuncu, 2018) and symmetric cross entropy (SCE) (Wang et al., 2019b) combine both the robustness of mean absolute error and classification strength of cross entropy loss. Other methods (Xia et al., 2021; Liu et al., 2020; Wei et al., 2021) add a regularization term to prevent the network from overfitting to noisy labels. Model re-calibration methods usually have strong assumptions and have limited performance when the noise rate is high. Data re-calibration methods achieve SOTA performance by either selecting trustworthy data or correcting labels that are suspected to be noise. Methods like Co-teaching (Han et al., 2018) and (Jiang et al., 2018; Yu et al., 2019) filter out noisy labels and train the network only on clean samples. Most recently, Tanaka et al. (2018) ; Zheng et al. (2020) ; Zhang et al. (2021) propose methods that can correct noisy labels using network predictions. Li et al. (2020) extends these methods by maintaining two networks and relabeling each data with a linear combination of the original label and the confidence of the peer network that takes augmented input. Training Segmentation Models with Label Noise. Most existing methods adapt methods for classification to the segmentation task. Zhu et al. (2019) utilize the sample re-weighting technique to train a robust model by adding more weights on reliable samples. Zhang et al. (2020c) extend Coteaching (Han et al., 2018) to Tri-teaching. Three networks are trained jointly, and each pair of networks alternatively select informative samples for the third network learning, according to the consensus and difference between their predictions. Zhang et al. (2020b) 

3. METHOD

We start by introducing our main intuition about each subsection before we discuss its details. In Section 3.1, we aim at modelling the noise due to inexact annotations. When an annotator segments an image by marking the boundary, the error annotation process resembles a random distortion around the true boundary. In this sense, the noisy segmentation boundary can be obtained by randomly distorting the true segmentation boundary. We model the random distortion with a Markov process. Each Markov step is controlled by two parameters θ 1 and θ 2 . θ 1 controls the probability of expansion/shrinkage. It has probability θ 1 to move towards exterior and 1 -θ 1 to move towards interior. θ 2 represents of the probability of marching, i.e., a point on the boundary have probability θ 2 to take a step and 1 -θ 2 to halt. Fig. 3 illustrates such a process. We start with the true label, go through two expansion steps and one shrinkage step. At each step, we mark the flipped boundary pixels. If θ 1 ̸ = 0.5, the random distortion will have a preference to expansion/shrinkage. This will result in a bias in the expected state. A DNN trained with such label noise be inevitably affected by the bias. Theoretically, this bias is challenging to be corrected by existing label noise methods, and in general by any bias-agnostic methods (Massart & Nédélec, 2006) . Therefore, we require a reasonably small validation set to remove the bias. In Section 3.2, we propose a provably-correct algorithm to correct the noisy labels by removing this bias. We start by T = 1. Since every pixel on the foreground/background boundary has the same probability to be flipped into noisy label, the expected state, i.e. the expectation of the Markov process, only has three cases, taking one step outside, taking one step inside, or staying unchanged. This indicates the relationship between the expected state and the true label can be linearized. We prove this using signed distance representation in Lemma 1. If the expected state for each image is given, with only one corresponding true label, we can recover the bias in the linear equation. However, in practice, the DNN is learned to predict the expected state, and there will be an approximation error. Therefore, more validation data may be required to get a precise estimation. In Theorem 1 we prove that with a fixed error and confidence level, the necessary validation set size is only O(1). The algorithm in Section 3.2 is designed for our Markov noise, where each point's moving probability only depends on its relative position on the boundary. In real-world, this probability can also be feature-dependent. To combat more general label noise, in Section 3.3, we extend our algorithm to correct labels iteratively based on logits. In practice, logits are the network outputs before sigmoid. Unlike distance function, logits contain feature information. A larger absolute logit value means more confident the prediction can be. Subtracting the bias from logits can move boundary points according to features, i.e., regions with large confidence move less than regions with small confidence. After correcting noisy labels, we retrain the DNN with new labels and do this iteratively until the estimated bias is small enough. We summarize our framework in Figure 2 . Notations. We assume a 2D input image X ∈ R H×W ×C with height H, width W , and channel C, although our algorithm naturally generalizes to 3D images. Y = {0, 1} H×W ×L is the underlying true segmentation mask with L classes in a one-hot manner. When training a segmentation model with label noise, we are provided with a noisy training dataset D = {(X n , Ỹn )} N n=1 . Here all noisy masks are sampled from the same distribution P ( Ỹ |X). P (Y |X) denotes the true label distribution. We use subscript s ∈ I to represent pixel index, where I = {(i, j)|1 ≤ i ≤ H, 1 ≤ j ≤ W } is the index set. The scalar form X s , Y s denote the pixel value of index s and its label. For the rest of the paper, we assume a binary segmentation task with Foreground (FG) and Background (BG) labels. Since Y is defined in one-hot manner, our formula and algorithm can be generalized to multi-class segmentation easily.

Model Training

Label Correction Figure 2: Framework of our method. We train a DNN using noisy labels. The learned DNN prediction boundary (red dashed line) is corrected to the new boundary (black solid line). We use corrected labels to re-train the network. The iterative algorithm can correct the noisy predictions to true labels progressively.  1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 (a) (b) (c)

3.1. MODELING LABEL NOISE AS A MARKOV PROCESS

For an input image X, we denote the clean and noisy masks Y and Ỹ ∈ {0, 1} H×W . The finite Markov process is denoted as M ϵ (T, θ 1 , θ 2 ), with T denoting the number of steps. θ 1 and θ 2 are two Bernoulli parameters denoting the annotation preference and annotation variance, respectively. To further enhance the modeling ability, we also introduce random flipping noise at regions far away from the boundary into our model. This is achieved by adding a matrix-valued random noise ϵ ∼ {Bernoulli(θ 3 )} H×W into the final step. The formal definition of M ϵ (T, θ 1 , θ 2 ) is in Definition 1. Denote by F , B the foreground and background masks, i.e. F = Y and B = 1 -Y . We define the boundary operator ∂• of F and B. Let ∂F = 1 {s|Fs=1,∃r∈Ns,Br=1} be the boundary mask of F , i.e. foreground pixels adjacent to B holding value 1, otherwise 0. N s is the four-neighbor of index s. Similarly, let ∂B be the boundary mask of B. Note that sets {∂F = 1} and {∂B = 1} are on the opposite sides of the boundary, as shown in Fig. 3(a) . For simplification, we will abuse the notation ∂B for both a matrix and a set of indices {∂B = 1}. Readers will notice the difference easily. The Markov process at the t-th step generates the t-th noisy label Ỹ (t) based on the (t -1)-th noisy label Ỹ (t-1) . Denote by F (t-1) and B (t-1) the foreground and background masks of Ỹ (t-1) . Definition 1 (Markov Label Noise). Let Ỹ (0) = Y . For t = 0, 1, ..., T -1, let z (t) 1 i.i.d.

∼

Bernoulli(θ 1 ), and Z (t) 2 i.i.d. ∼ {Bernoulli(θ 2 )} H×W . Ỹ (t+1) = Ỹ (t) + z (t) 1 Z (t) 2 ⊙ ∂B (t) + (z (t) 1 -1)Z (t) 2 ⊙ ∂F (t) , The final output noisy label is Ỹ = Ỹ (T ) + ϵ ⊙ Sign, where Sign = B (T ) ⊙ B -F (T ) ⊙ F . The process is denoted by M ϵ (T, θ 1 , θ 2 ). The random variable z (t) 1 determines whether the noise is obtained by expanding or shrinking the boundary in each step t. If expansion, z (t) 1 = 1, the error is taken by flipping the background boundary pixels ∂B (t-1) with probability θ 2 . This is encoded in the second Bernoulli Z (t) 2 ⊙ ∂B (t) , where ∂B (t) is an indication matrix to restrict the flipping only happens among background boundary pixels. Similarly, if shrinkage, z (t) 1 = 0, a pixel in ∂F (t-1) has probability θ 2 to be flipped into background. Fig. 3 shows an example with 3 steps, corresponding to expansion, expansion and shrinkage.

3.2. LABEL CORRECTION BY REMOVING BIAS

Suppose the Markov noise has underlying posterior P ( Ỹ |X), which is impossible to get a general explicit form due to the progressive spatial dependency. However, we can study the Bayes classifier c(X) = arg max Ỹ P ( Ỹ |X). For a fixed image X, we take Ỹ ∼ P ( Ỹ |X) as a random variable. Then E[ Ỹ ] and c(X) have, c(X) = [E[ Ỹ ]] ≥0.5 , where  [E[ Ỹ ]] ≥0.5 means for each index s, [E[ Ỹ ]] ≥0.5 (s) = 1, if [E[ Ỹs ]] ≥ 0.5, otherwise, 0. Note that E[ Ỹ ] is a probability map while [E[ Ỹ ]] ≥0. E[ Ỹ ] = E[Y ] + θ 1 θ 2 T -1 t=0 E[∂B (t) ] + (θ 1 -1)θ 2 T -1 t=0 E[∂F (t) ] + θ 3 E[Sign]. Equation 3 defines the bias between E[Y ] and E[ Ỹ ]. But the ambiguous representation of E[∂B (t) ] makes it difficult to simplify the bias. In order to analyze it, we need to transform the explicit boundary representation ∂B into an implicit function. For a domain Ω, we define its interface (boundary) as ∂Ω, its interior and exterior as Ω -and Ω + , respectively. An implicit interface representation defines the interface as the isocontour of some function ϕ(•). Typically, the interface is defined as ∂Ω = {s ∈ Ω|ϕ(s) = 0}. A straightforward implicit function is the signed distance function. The distance function is defined by d(s) = min t∈∂B d(s, t) + 1, if s ∈ B, min t∈∂F d(s, t) + 1, otherwise. The signed distance function ϕ(s) is then defined by ϕ(s ) = d(s), if s ∈ B. Otherwise, ϕ(s) = -d(s). We start from the case T = 1. With the signed distance representation, we have the following lemma. Lemma 1. ϕ is the signed distance function defined for c(X). If θ 3 ≪ 0.5, then φ = ϕ -[θ 1 θ 2 ] ≥0.5 -1, if θ 1 ≥ 0.5, ϕ + [1 + θ 1 θ 2 -θ 2 ] <0.5 + 1 otherwise. ( ) Here φ is the signed distance function of c(X) when T = 1. With the signed distance function, we can explicitly define the bias by ∆ = 1 |I| s∈I ( φs -ϕ s ), where |I| is the cardinality of the index set, also the image size. If c(X) is perfectly learned, the difference between φ and ϕ, i.e. ∆ can be estimated with a small clean validation set, even if there is only one image. Then the original function ϕ can be recovered among training set by ϕ = φ + ∆. Then c(X) can be simply obtained by [ϕ] ≤0 . However, in practice, the classifier learned, denoted by ĉ(X), is different from the noisy Bayes optimal c(X). This will lead an error to φ, signed distance function of ĉ(X), for which more validation data may be required. The naive algorithm is, 1. Bias estimation. Train a DNN with noisy labels. Given a clean validation set {x v , y v } V v=1 , the bias is estimated by ∆ = 1 V V v=1 1 |I| s∈I ( φv s -ϕ v s ). 2. Label correction. Correct training labels by c ′ (X) = [ϕ ′ ] ≤0 , where ϕ ′ = φ -∆. Then retrain the network using corrected labels. In Theorem 1 we prove that with a small validation size, the above algorithm can recover the true label with bounded error. Theorem 1. If ∃ ε 0 > 0, ε 1 > 0, s.t. E X sup s | φ(X s ) -φ(X s )| ≤ ε 0 and sup X sup s | φ(X s ) -φ(X s )| ≤ ε 1 hold for the learned classifier ĉ(X), then ∀ε > ε 0 and for a fixed confidence level 0 < α ≤ 1, with V ≥ ε 2 1 2(ε -ε 0 ) 2 log 2|I| α (7) number of clean samples {x v , y v } V v=1 , ϕ ′ can be recovered within ε + ε 0 error with probability at least 1 -α, i.e. P (E X [sup s |ϕ ′ (X s ) -ϕ(X s )|] ≤ ε + ε 0 ) ≥ 1 -α. Proofs of Lemma 1 and Theorem 1 are provided in Appendix. The theorem shows that for a fixed error level ε and a fixed confidence level α, the validation size V required is a constant, logarithm of the image size. Note that ε 0 is the mean model error among images and ε 1 is the supremum model error among images. It holds naturally that ε 0 ≤ ε 1 . ε 1 is to constrain the model prediction variance on different images. If ε 0 = ε 1 , the model prediction quality is similar over images. Even in the worst case, where model cannot predict any right label, ε 1 is still bounded by the image size |I|.

3.3. ITERATIVE LABEL CORRECTION

In this section, we extend our algorithm to more general label noise. In our Markov model, all points on the boundary are moving with the same probability along the normal direction of a signed distance function. The boundary in real-world noise, however, could have feature-dependent moving probability. We achieve this by using logit function representation. For an image X, its logit function f (X) is defined by EY = σ • f (X), where σ(•) is a sigmoid function and Y is the segmentation label. Note that f (X) is positive in interior Ω -and negative in Ω + , so the implicit function is defined by -f (X). The bias estimation step remains the same. But at the label correction step, we apply the bias on f (X) instead of φ. In practice, f (X) is the model output before the sigmoid function. The label correction step by logit function is f ′ (X) = f (X) + λ exp - φ2 2(γ ∆) 2 . ( ) λ is the bias adapted to logit function and the exponential term is a decay function to constrain the bias around φ = 0. And the decay factor is γ ∆, where 0 < γ ≤ 1 is a hyper-parameter, usually set to be 1. The bias λ is defined by λ = inf f | 0≤ φs≤ ∆, if ∆ > 0, and λ = sup f | ∆≤ φs≤0 , if ∆ < 0. f | Ω is function f restricted on domain Ω. Although the bias is decayed with the same scale over the distance transform, the gradient of logit function, however, varies along the normal direction. Therefore, Equation 8 actually moves points on the interface with different degrees. Points with smaller absolute gradient moves more. Since the algorithm corrects noisy labels spatially correlated to the boundary, we refer to it as Spatial Correction (SC). We present our SC as an iterative approach in Algorithm 1. In practice, the hyper-parameter γ is usually set to be 1, and the algorithm can terminate after only 1 iteration.

Algorithm 1 Spatial Correction

Input: Noisy training dataset D, a small clean validation dataset V, and a hyper-parameter γ. Output: A robust DNN f trained with denoised dataset. Train a DNN f with D. Predict labels on V using f , and estimate ∆ by Equation 6. while | ∆| ≥ 1 do Replace training labels with [f ′ (X)] ≥0 , where f ′ (X) is calculated by Equation 8. Re-train f with corrected labels. Predict labels on V using new f , and estimate ∆ by Equation 6. end while return f . 

4. EXPERIMENTS

In this section, we provide an extensive evaluation of our approach with multiple datasets and noise settings. We also show in ablation studies, that our method is robust to high noise level and the required clean validation set size can be extremely small.

4.1. DATASETS AND IMPLEMENTATION DETAILS

Synthetic noise settings. We use three public medical image datasets, JSRT dataset (Shiraishi et al., 2000) For each of these three datasets, we use three noise settings, denoted by S E , S S and S M . S E and S S are two settings synthesized by our Markov process with θ 1 > 0.5 (expansion) and θ 1 < 0.5 (shrinkage), respectively. Figure 4 shows examples of our synthesized label noise. We also include the mix of random dilation and erosion noise S M used by previous work (Zhu et al., 2019; Zhang et al., 2020b; a) . This is achieved by randomly dilate or erode a mask with a number of pixels. Note that our Markov label noise can theoretically include this type of noise by setting θ 1 = 0.5. Detailed parameters for these settings are provided in the Appendix. We use a simple U-Net as our backbone network for JSRT and ISIC 2017 datasets and a 3D-UNet for Brats 2020 dataset. The hyper-parameter γ is set to be 1 and total iteration is 1. From top to bottom, the rows are sample slices with noisy masks, with true masks, and with corrected masks, respectively. Real-world label noise. To evaluate with real-world label noise is challenging. We are not aware of any public medical image segmentation dataset that has both true labels and noisy labels from human annotators. Therefore, we use a multi-annotator dataset, LIDC-IDRI dataset (Armato III et al., 2015; Armato et al., 2011; Clark et al., 2013) , and the coarse segmentation in a vision dataset, Cityscapes (Cordts et al., 2016) . The LIDC-IDRI dataset consists of 1018 3D thorax CT scans where four radiologists have annotated multiple lung nodules in each scan. The dataset was annotated by 12 radiologists, and it is not possible to match an annotation to an expert. We use the majority voting as the true labels and the union of four annotations as noisy labels. We process and split the data exactly the same way as Kohl et al. (2018) . Cityscapes dataset contains 5000 finely annotated images along with a coarse segmentation by human annotators that we use as the "noisy label". We only focus on the 'car' class because (1) cars are popular objects and are frequently included in images; (2) the coarse annotation of cars is very similar to noisy annotation in medical imagingthey are reasonable distortions of the clean label without changing the topology. See Figure 4c for an example. The detailed settings of LIDC-IDRI and Cityscapes can be found in Appendix A.2.1. Baselines. We compare the proposed SC with SOTA learning-with-label-noise methods from both classification (GCE (Zhang & Sabuncu, 2018) , SCE (Wang et al., 2019b) , CT+ (Yu et al., 2019) , ELR (Liu et al., 2020) ), CDR (Xia et al., 2021) and segmentation contexts (QAM (Zhu et al., 2019) , CLE (Zhang et al., 2020b) ). Technical details of these baseline methods are provided in Appendix A.2.2. Our method requires a small clean validation set whereas most baselines do not. For a fair comparison, we also use the clean validation set to strengthen the baselines. In particular, we pretrain the baselines using the clean validation dataset, and then add the validation images and their clean labels into the training set. Note that our method (SC) is only trained on the original noisy trianing set; it only uses the validation set to estimate bias.

4.2. RESULTS

Table 1 shows the segmentation results of different methods with synthetic noisy label settings on JSRT , ISIC 2017 and Brats 2020 dataset. Note that QAM cannot be applied to Brats 2020 dataset because their network is designed for 2D only. We compare DICE score (DSC) on testing sets (against the clean labels). For each setting, we train 5 different models, and report the mean DSC and standard deviation. In S E and S S , where biases show up in noisy labels, the proposed method outperforms the baselines by a big leap in total case. The compared methods, however, only work when little bias is included, like S M . S M is equivalent to setting θ 1 = 0.5 in our Markov model, resulting in ∆ = 0. We also test the proposed method on real-world label noise, results shows in Table 2 . Figure 5 shows examples of label correction results. We provide more qualitative results in the Appendix A.4.

4.3. ABLATION STUDY

Increasing Noise Level. Our proposed method is robust even when the noise level is high. In Figure 6a , we compare the prposed SC with SCE and ELR. We increase the synthetic noise level on JSRT dataset by increasing the step T in our Markov model, while keep the same θ 1 and θ 2 (details and illustrations in the supplementary material). Results show our method is still robust even under extreme noise level, while the performance of GCE and ELR drops rapidly as the noise level increases. Decreasing Validation Size. In this experiment, we show that SC works well even with an extremely small clean validation set. Following setting S M of JSRT dataset, we shrink the validation set from 24 to 18, 12, 6, 1, respectively. We compare the performance with SCE and ELR. Results in Figure 6b show that our proposed method still works well when the validation size is extremely small, even if only one clean sample is provided.

CONCLUSION

In this paper, we proposed a Markov process to model segmentation label noise. Targeting such label noise model, we proposed a label correction method to recover true labels progressively. We provide theoretical guarantees of the correctness of a conceptual algorithm and relax it into a more practical algorithm, called SC. Our experiments show significant improvements over existing approaches on both synthetic and real-world label noise.

A APPENDIX

A.1 PROOFS Lemma 1. ϕ is the signed distance function defined for domain [EY ] ≥0.5 . If θ 3 ≪ 0.5, then φ = ϕ -[θ 1 θ 2 ] ≥0.5 -1, if θ 1 ≥ 0.5, ϕ + [1 + θ 1 θ 2 -θ 2 ] <0.5 + 1 otherwise. (A.1) Here φ is the signed distance function of [E Ỹ ] ≥0.5 when T = 1. Proof. When T = 1, the given Y is deterministic, we have E[ Ỹ ] = Y + θ 1 θ 2 E[∂B] + (θ 1 -1)θ 2 E[∂F ] + θ 3 E[Sign]. (A.2) Recall that Sign = B (T ) ⊙ B -F (T ) ⊙ F , indicating Sign s = 0 if s ∈ ∂F ∪ ∂B. Therefore, E[ Ỹs ] =    θ 1 θ 2 , s ∈ ∂B 1 + θ 1 θ 2 -θ 2 , s ∈ ∂F Y s ± θ 3 , otherwise (A.3) Since Y s is binary, and if θ s ≪ 0.5, [Y s ± θ 3 ] ≥0.5 = Y s . Therefore, c(X s ) =    [θ 1 θ 2 ] ≥0.5 , s ∈ ∂B [1 + θ 1 θ 2 -θ 2 ] ≥0.5 , s ∈ ∂F Y s , otherwise. (A.4) We claim the following three facts. (1) θ 1 θ 2 ≥ 0.5 only if θ ≥ 0.5. This is true because if θ < 0.5, θ 1 θ 2 < 0.5 since θ 2 ≤ 1. (2) 1 + θ 1 θ 2 -θ 2 < 0.5 only if θ < 0.5 If θ ≥ 0.5, then 1 + (θ 1 -1)θ 2 ≥ 1 -0.5θ 2 ≥ 0.5. (3) θ 1 θ 2 ≥ 0.5 and 1 + θ 1 θ 2 -θ 2 < 0.5 are mutual exclusive. First we notice when θ 1 θ 2 ≥ 0.5, 1 + θ 1 θ 2 -θ 2 ≥ 1.5 -θ 2 ≥ 0.5. Then when 1 + θ 1 θ 2 -θ 2 < 0.5, θ 1 θ 2 < θ 2 -0.5 < 0.5. The first two facts separate the two cases in Equation A.1. And according to fact (3), either ∂B is flipped into foreground or ∂F is flipped into background. In the former case, expansion happens because θ 1 > 0.5, and ∂B becomes ∂ F , leading to φ = ϕ -[θ 1 θ 2 ] ≥0.5 -1. Opposite happens in the latter case. Therefore, φ = ϕ -[θ 1 θ 2 ] ≥0.5 -1, if θ 1 ≥ 0.5, ϕ + [1 + θ 1 θ 2 -θ 2 ] <0.5 + 1 otherwise. (A.5) Proof done. Theorem 1. If ∃ε 0 , ε 1 > 0, s.t. E X sup s | φ(X s ) -φ(X s )| ≤ ε 0 , (A.6) and sup X sup s | φ(X s ) -φ(X s )| ≤ ε 1 , (A.7) hold for the learned classifier ĉ(X), then ∀ε > ε 0 and for a fixed confidence level 0 ≤ α ≤ 1, with V ≥ ε 2 1 2(ε -ε 0 ) 2 log 2|I| α (A.8) number of clean samples {x v , y v } V v=1 , ϕ ′ can be recovered within ε + ε 0 error with probability at least 1 -α, i.e. P (E X [sup s |ϕ ′ (X s ) -ϕ(X s )|] ≤ ε + ε 0 ) ≥ 1 -α. and if V ≥ ε 2 1 (ε -ε 0 ) log 2|I| α , (A.20) then P (| ∆ -∆| ≥ ε) ≤ α. (A.21) Next we are about bounding the label correction error. Note that ϕ = φ-∆, and our label correction step indicates that ϕ ′ = φ -∆. Then we have, E X sup s |ϕ ′ (X s ) -ϕ(X s )| = E X sup s |( φ(X s ) -φ(X s )) + (∆ -∆)| ≤ E X sup s |( φ(X s ) -φ(X s ))| + |∆ -∆| ≤ ε 0 + |∆ -∆|. (A.22) Therefore, P (| ∆ -∆| ≥ ε) = P (| ∆ -∆| + ε 0 ≥ ε + ε 0 ) ≥ P E X sup s |ϕ ′ (X s ) -ϕ(X s )| ≥ ε + ε 0 . (A.23) According to A.21, we proved 250, 250, 170), (300, 300, 220), (350, 350, 270), (400, 400, 320) , respectively. θ 3 is set 0.1 for all settings. To improve the smoothness, we also use a Gaussian filter before the random flipping. An example for increasing noise level of heart class is shown in A.2.2 BASELINES. P E X sup s |ϕ ′ (X s ) -ϕ(X s )| ≥ ε + ε 0 ≤ α. (A. We compare the proposed SC with current SOTA methods from both classification context and segmentation context: (1) GCE Zhang & Sabuncu (2018) trains the deep neural networks with a generalized cross entropy loss to handle noisy labels. The hyperparameters k, q in this work are set to be 0.5 and 0.8, respectively. (2)SCE Wang et al. (2019b) combines the cross entropy and reverse cross entropy (RCE) into a single noise robust loss. The hyperparameters α and β are set to be 1.0 and 0.5 for JSRT and ISIC 2017 dataset, and α = 1.0, β = 0.5 for Cityscapes dataset. (3)CT+ Yu et al. (2019) utilizes two networks and selects small-loss instances for cross training. To employ this method into segmentation, we treat each pixel as an instance. The prior estimated noise rate τ is estimated with the clean validation set. ( 4 Then train a robust network with corrected labels.

A.3 ILLUSTRATIVE EXPERIMENTS

Network predictions can be over confident if bias shows up in training labels. Some works (Zhang et al., 2020b; Li et al., 2021) choose to trust the network prediction probability to correct label noises, i.e. they believe predictions with small confidence is likely to be wrong while pixels with large confidence tend to be correctly predicted. However, the network can fit to noisy labels quickly and be overconfident when trained with biased noisy labels. This phenomenon is also observed by Zhang et al. (2016) . In Figure A.3(a) , we train a network with dilated noises and show its prediction probability map, i.e. the output after sigmoid. The red pixel is predicted as foreground with a high probability 0.9994874, whereas it is actually in background. Therefore, methods based on trusting the network predictions cannot correct this label because the network is over confident. And since the network can fit to noise rapidly, early learning techniques also cannot eliminate this bias. Our method works because we do not trust the network prediction. Instead, we compare it to the clean label in validation set and estimate the bias. We then eliminate this bias in the training prediction. We also show how the prediction probability changes while training the model in Figure A.3(b) . It shows that the model can be over confident quickly while training. So methods that employ early learning techniques (Liu et al., 2020; Arpit et al., 2017; Liu et al., 2022) are hard to work under biased noisy labels. 



Codes are available at https://github.com/michaelofsbu/SpatialCorrection.



Figure 1: (a) Original image with true segmentation boundary (blue dash line). (b) Classification label noise model in segmentation context is unrealistic, where the label noise (small squares) spread allover the mask. (c) A realistic segmentation noise generated by our noise model. The noise is mostly about distortions of the boundary. A few random flippings appear at the interior/exterior.

Figure 3: Illustration of a 3-step Markov process. (a) The true label mask, where red pixels are foreground and blue pixels are background. We mark background boundary pixels and foreground boundary pixels as ∂B and ∂F , respectively. (b) An expansion step. Pixels marked as '1' have been flipped into foreground. The flipped pixels are randomly chosen with probability θ 2 from the ∂B pixels in (a). (c) Another expansion step by flipping pixels marked as '2' to foreground. (d) A shrinkage step. Pixels marked as '3' were foreground in (c) but are flipped into background in (d).

Following Osher & Fedkiw  (2003), a distance function d(s) is defined as d(s) = min t∈∂Ω (|s -t|), implying that d(s) = 0 on s ∈ ∂Ω. A signed distance function is an implicit function ϕ defined by |ϕ(s)| = d(s) for all s ∈ Ω, s.t. ϕ(s) = d(s) = 0 for s ∈ ∂Ω, ϕ(s) = d(s) for s ∈ Ω + , and ϕ(s) = -d(s) for s ∈ Ω -. For a domain of grid points, we define the distance function d(s) by the shortest path length from s to ∂Ω. But note that ∂Ω does not exist among image indices. It lies between ∂B and ∂F . The corresponding definition for image indices is as follows.Definition 2 (Signed Distance Function). For an Image X with index set I, a graph G(S, E) is constructed by S = I and E = {s → t : s ∈ I, t ∈ N s }, where N s is the four-neighbor of index s. Then the distance d(s, t) is defined by the length of the shortest path between s and t in graph G.

(a) JSRT SE. (b) ISIC 2017 SS. (c) Cityscapes.

Figure 4: Examples of synthetic and real-world noise. In each image, blue line is the true segmentation boundary, and all other colors are corresponding noisy boundaries. We removed the random flipping noise in visualization to focus on the boundary.

Figure 5: Label correction results for Brats 2020 datasets with S E (Left) and S S (Right) settings.From top to bottom, the rows are sample slices with noisy masks, with true masks, and with corrected masks, respectively.

Figure 6: Ablation study.

Fig. A.1.

Figure A.1: The increasing noise level in 'heart' class. Red line is noisy boundary and blue line is true boundary.

Figure A.2: Compare equal dilation (a) and equal erosion (c) with Markov noise (b) and (d) generated by our Markov model.

)ELR Liu et al. (2020)  utilizes the early stopping technique into a regularization term to prevent the network from memorizing noisy labels. The hyperparameters λ and β are set to be 7 and 0.8, respectively. (5)QAMZhu et al. (2019) reweights samples with a quality awareness module (QAM), trained together with the segmentation model. The outputs of QAM are the weights for each image in the loss function. (6)CLE Zhang  et al. (2020b)  leverages confident learning to correct noisy labels based on the network prediction.

The probability of the red point while training.

Figure A.3: A model trained with biased labels can be over confident. The model is trained with dilated noises. In the left figure, the blue line is the true segmentation boundary, and the red point is the selected pixel where it is supposed to be background (in the true label) but is predicted as foreground. The right figure shows how the network prediction on the red point fits to noise along training. Noisy annotations with random inner holes. Our Markov noise model assumes the spatial bias occurs around the true boundary, but it does not reject other random noises like inner holes. Figure A.4 shows an expanded noise with low-frequency inner holes (1st row), and our method can correct this hole after 2 iterations.

Figure A.4: A model trained with noisy labels with random inner holes (first row). The corrected label by the proposed method after 1 iteration (second row) and 2 iterations (third row).

Figure A.6: Qualitative results of different model predictions -part one. Blue lines are true boundaries. All other colors are corresponding prediction boundaries.

Mean DSC (in percent) and standard deviation for five models trained on three noisy settings. Method with best mean DSC is highlighted for each noise setting.

Mean DSC (in percent) and standard deviation for five models trained on Cityscapes datasets and LIDC-IDRI dataset. Method with best mean DSC is highlighted.

Synthetic Noise Setting. The synthetic noise follows parameters in TableA.1. S E stands for expansion setting and S S standing for shrinkage setting. And for the ablation study of noise level, we keep the same θ 1 , θ 2 in S

1: Label noise settings on two synthetic datasets. M (T, θ 1 , θ 2 ) stands for the proposed multi-step Markov process.Dataset Settings. For Brats 2020 dataset, we merge all three classes into a single class. The reason is that some classes are so few in volumes that most of them could vanish when we create the

ACKNOWLEDGMENTS

The authors acknowledge the National Cancer Institute and the Foundation for the National Institutes of Health, and their critical role in the creation of the free publicly available LIDC/IDRI Database used in this study. This research of Jiachen Yao and Chao Chen was partly supported by NSF CCF-2144901. The reported research of Prateek Prasanna was partly supported by NIH 1R21CA258493-01A1. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Mayank Goswami would like to acknowledge support from US National Science Foundation (NSF) grant CCF-1910873. 

annex

Proof. To bound the label correction error, we first aim to bound the bias estimation error. We try to prove that P (| ∆ -∆| ≥ ε) ≤ α. (A.9)Recall that the definition of ∆ isand ∆ is the true bias that is the same for every image X. Hence,Take Equation A.10 and Equation A.11 into the LHS of Inequality A.9,(A.12) Therefore, to prove A.9 is equivalent to provingAnd the error is bounded by the image size, i.e. 0 ≤ | φv s -φv s | ≤ ε 1 . According to Hoeffding's inequality, a lower bound is provided for the following probability,By the given model error A.6, ∀s ∈ I,Combine Inequality A.14 and A.15,Observing the LHS of A.13 and A.16, the difference is that random variable in A.13 is the average error among a specific group of indices, while A.16 holds for arbitrary index. If we iterate A.16 over the index set, then for a group of indices, A.13 naturally holds. In other words,To obtain A.13, let 

A.5 EXTENTED NOTATION

Although our notation is for 2D images, our method naturally generalizes to 3D. For 3D input X ∈ R H×W ×D×C , to define the spatial correlation in 3D input, we just need to redefine the neighbor N s as the six-neighbor, i.e. the four-neighbor in the same slice and up and down pixel in adjacent slices. In general, our spatial correlation can be defined for images of arbitrary dimension. 

