DIFFUSION ADVERSARIAL REPRESENTATION LEARN-ING FOR SELF-SUPERVISED VESSEL SEGMENTATION

Abstract

Vessel segmentation in medical images is one of the important tasks in the diagnosis of vascular diseases and therapy planning. Although learning-based segmentation approaches have been extensively studied, a large amount of groundtruth labels are required in supervised methods and confusing background structures make neural networks hard to segment vessels in an unsupervised manner. To address this, here we introduce a novel diffusion adversarial representation learning (DARL) model that leverages a denoising diffusion probabilistic model with adversarial learning, and apply it to vessel segmentation. In particular, for self-supervised vessel segmentation, DARL learns the background signal using a diffusion module, which lets a generation module effectively provide vessel representations. Also, by adversarial learning based on the proposed switchable spatially-adaptive denormalization, our model estimates synthetic fake vessel images as well as vessel segmentation masks, which further makes the model capture vessel-relevant semantic information. Once the proposed model is trained, the model generates segmentation masks in a single step and can be applied to general vascular structure segmentation of coronary angiography and retinal images. Experimental results on various datasets show that our method significantly outperforms existing unsupervised and self-supervised vessel segmentation methods. * co-first authors We build our model on X-ray coronary angiography using XCAD dataset (Ma et al., 2021) and apply to several different blood vessel datasets, including retinal images. Experimental results show that our method outperforms several baseline methods by large margins for vessel segmentation tasks in the absence of labeled data. The main contributions are summarized as: 1. We propose a diffusion adversarial representation model, a non-iterative version of diffusion model for image generation, and apply it for self-supervised vessel segmentation.

1. INTRODUCTION

In the clinical diagnosis of vascular diseases, vessel segmentation is necessary to analyze the vessel structures and therapy planning. In particular, when diagnosing coronary artery disease, X-ray angiography is taken to enhance vessel visualization by injecting a contrast agent into the blood vessels (Cong et al., 2015) . However, it is challenging to extract vessels accurately due to low contrast, motion artifacts, many tiny branches, structural interference in the backgrounds, etc (Xia et al., 2019; Chen et al., 2014) . To segment vascular structures, various segmentation methods have been explored. Traditional optimization models (Law & Chung, 2009; Taghizadeh Dehkordi et al., 2014) typically require complicated preprocessing steps and manual tuning. Furthermore, they are computationally expensive to process many images. On the other hand, learning-based methods (Nasr-Esfahani et al., 2016; Fan et al., 2018; Chen et al., 2019) generate segmentation maps in real-time once the models are trained. However, supervised methods require a huge amount of labeled data for training, which complicates their use in practical applications. Also, existing unsupervised methods designed on natural images are difficult to apply to medical vessel images due to low contrast subtle branches and confusing background structures. Although a recent self-supervised method (Ma et al., 2021) is presented to learn vessel representations, this requires two different adversarial networks to segment vessels, which leads to increasing training complexity. Recently, diffusion models such as denoising diffusion probabilistic model (DDPM) (Ho et al., 2020) has become one of the main research topics in modeling data distribution and sampling diverse images. By learning the Markov transformation of the reverse diffusion process from Gaussian noise to data, DDPM is successfully applied to many low-level computer vision tasks such as superresolution (Chung et al., 2022) , inpainting (Lugmayr et al., 2022) , and colorization (Song et al., 2020) . For high-level vision tasks, while a recent study (Baranchuk et al., 2021) shows that DDPM can capture semantic information and be used as image representations, methods applying DDPM in learning semantic segmentation without labeled data have so far not been developed. Also, the sampling process of the diffusion models often takes a relatively long time. In this paper, we introduce a novel concept of diffusion adversarial representation learning (DARL), which is a non-iterative version of the diffusion-based generative model and can be successfully applied to self-supervised vessel segmentation without ground-truth labels. As illustrated in Figure 1, our model is composed of a diffusion module and a generation module, which learns semantic information of vessels via adversarial learning. Specifically, based on the observation that the diffusion model estimates the noise added to the perturbed input data, and the adversarial learning model generates images for given the noisy vectors, we can naturally connect the diffusion model with the adversarial model. This allows our model not only to generate images in real time but also to segment vessels with robustness to noises and various modalities. Here, inspired by the spatiallyadaptive denormalization (SPADE) layer (Park et al., 2019) that is effective in image synthesis given semantic masks, we present a switchable version of SPADE in the generation module to estimate vessel segmentation maps and mask-based fake angiograms simultaneously. This can yield a synergy effect in learning vessel representation by extracting proper features for angiogram synthesis. More specifically, as shown in Figure 1 , for given unpaired background images and angiography images that are taken before and after injection of the contrast agent, there are two paths for feeding the inputs into our proposed model: (A) when the real angiography images are given, our model without the SPADE estimates vessel segmentation maps; (B) when the background images are given, our model with the SPADE generates synthetic angiograms that composite vessel-like semantic masks with the input backgrounds. Also, as each vessel-like semantic mask in the (B) path can be regarded as the pseudo-label for the generated angiography image, by feeding the synthetic angiograms into the (A) path again, we apply the cycle consistency between the segmentation maps and the labels of fractal masks to capture semantic information of vessels. In addition, by designing the diffusion module to intensively learn the background signal, we let the module consider vessel structures of angiography images as outlier when estimating the latent feature. Thereby, vessel structures represented in the output of the diffusion module can guide the generation module to effectively segment the vessels. Specifically, the latent features of our diffusion module provide vessel information and thus improve the segmentation performance. 2. Through the proposed generation module with switchable SPADE layers, our model not only generates synthetic angiography images but also segments vessel structures. 3. Experimental results verify that our model achieves superior segmentation performance by learning vessel representations. In particular, although the model is trained using X-ray coronary angiograms, it provides the state-of-the-art performance for un-/self-supervised retinal vessel segmentation as well, confirming the generalization capability of the model.

2. BACKGROUNDS AND RELATED WORKS

Denoising diffusion probabilistic model Diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019) is one of generative models that sample realistic data by learning the distribution of real images. In particular, the denoising diffusion probabilistic model (DDPM) (Ho et al., 2020) with a score matching has been shown superior performance in image generation. Specifically, DDPM learns the Markov chain to convert the Gaussian noise distribution x T ∼ N (0, I) into the target distribution x 0 . In the forward diffusion process, the noise is gradually added the noise to the data by: q(x t |x t-1 ) = N (x t ; 1 -β t x t-1 , β t I), where β t ∈ [0, 1] is a fixed variance. Accordingly, a noisy target x t distribution from the data x 0 is represented as: q(x t |x 0 ) = N (x t ; √ α t x 0 , (1 -α t )I), where α t = Π t s=1 (1 -β s ). Then, DDPM is trained to approximate reverse diffusion process: p θ (x t-1 |x t ) = N (x t-1 ; µ θ (x t , t), σ 2 t I), where σ t is a fixed variance, and µ θ is a parameterized mean with the noise predictor ϵ θ : µ θ (x t , t) = 1 √ 1 -β t x t - β t √ 1 -α t ϵ θ (x t , t) . Thus, in the generative process, the sample can be obtained from the Gaussian noise by the iterative denoising steps: x t-1 = µ θ (x t , t) + σ t z, where z ∼ N (0, I). Through this stochastic process, DDPM provides diverse realistic samples and has been exploited in many applications, including super-resolution (Chung et al., 2022; Saharia et al., 2021) , inpainting (Lugmayr et al., 2022) , and colorization (Song et al., 2020; Saharia et al., 2022) . However, the application study of semantic segmentation is limited. Although several works (Baranchuk et al., 2021; Amit et al., 2021) are recently presented to solve high-level vision problems, they require annotated data to train the models. Self-supervised vessel segmentation For the vessel segmentation task, it is difficult to obtain fine-grained labels for supervised learning, since the vessel has complex structures with numerous tiny branches. While this label scarcity issue can be alleviated by semi-or unsupervised learning, fully unsupervised methods to segment the tiny vessels with reasonable performance are relatively scarce. In fact, recent unsupervised learning methods trained with natural images have great generalization capability on unseen datasets (Ahn et al., 2021; Chen et al., 2019; Melas-Kyriazi et al., 2022) , thus they can be easily adapted to medical image segmentation tasks. However, due to the unique characteristics of angiography, e.g. confusing background factors and sophisticated vessel structures, any unsupervised methods designed for natural image segmentation may get degraded performance when they are applied to vessel segmentation of noisy angiography images. As a type of unsupervised learning, self-supervised learning also has been introduced to utilize self-generated supervisory labels from data themselves to efficiently learn target representations in various medical image segmentation tasks and has demonstrated its potential (Mahmood et al., 2019; Ma et al., 2021; Oh & Ye, 2021) . Specifically, Ma et al. (2021) introduces an end-to-end adversarial learning framework for vessel segmentation with the CycleGAN (Zhu et al., 2017) structure, which learns realistic angiogram generation that adds fractal-guided pseudo labels to the background images. However, the simple arithmetic operation for synthetic vessel generation often fails to yield realistic pseudovessel images, thus training the adversarial networks using unrealistic synthetic images is difficult to produce optimal segmentation performance. 

3. DIFFUSION ADVERSARIAL REPRESENTATION LEARNING

In this section, we describe our novel diffusion adversarial representation learning (DARL) model, tailored for self-supervised vessel segmentation. We call the images before and after injecting the contrast agents into the blood vessels as background and angiography, respectively. Note that due to the different scanning times, these two images have different contrasts and are not aligned, caused by the movements of patients. Thus, as shown in Figure 1 , our DARL model is trained on unpaired angiography images x a 0 and background images x b 0 . Specifically, our model is comprised of a diffusion module ϵ θ to estimate latent features, a generation module G to estimate both the vessel segmentation masks ŝv and the synthetic angiograms xa , and two discriminators (D s , D a ) to distinguish real and fake images of the vessel masks and the angiograms, respectively. Here, in generating angiography images, we provide the vessel-like fractal masks s f presented by Ma et al. (2021) to the generation module to perform image synthesis based on semantic layouts. Moreover, to estimate the segmentation maps and angiography images effectively, we design the generation module with novel switchable SPADE layers, where the SPADE Park et al. (2019) facilitates the semantic image synthesis. Generation module with switchable SPADE layers As illustrated in Figure 1 , the proposed generation module consists of N residual blocks (ResnetBlock) that have switchable SPADE (S-SPADE) layers. Note that the (A) and (B) paths of our model are implemented simultaneously by sharing the learnable parameters except for the S-SPADE layers. Then, let v ∈ R M ×C×H×W be the feature map in the ResnetBlock, where M , C, H, and W are the size of batch, channel, height, and width, respectively. The switchable SPADE layer normalizes feature maps differently depending on the existence of an input mask s: v = SPADE(v, s), if mask s is given, IN(v), otherwise, ( ) where IN is the instance normalization (Ulyanov et al., 2017) . So, when our model is given the fake vessel mask s f , the SPADE is computed by: x m,c,h,w = γ c,h,w (s f ) x m,c,h,w -µ c σ c + β c,h,w (s f ), where x m,c,h,w denotes the (m, c, h, w)-th element of the feature tensor v, (µ c , σ c ) are the mean and standard deviation of the feature map in channel c, and (γ c,h,w , β c,h,w ) are learned modulation parameters during training. Thus, in the (A) path, given the noisy angiogram x a t and the latent feature ϵ θ (x a t , t), the generation module G estimates the vessel segmentation masks ŝv without the SPADE: ŝv = G(ϵ θ (x a t , t); 0). On the other hand, in the (B) path that provides the fractal mask s f , the generation module taking the noisy background x b t and its latent feature G(ϵ θ (x b t , t)) synthesizes the fake angiograms xa : xa = G(ϵ θ (x b t , t); s f ).

3.1. NETWORK TRAINING

In contrast to the DDPM that pretrains the diffusion model, our method trains the diffusion module, generation module, and discriminators simultaneously using an adversarial learning framework. Figure 2 depicts the detailed training flow of our model. There are two distinct paths: (A) one feeds the real angiograms x a 0 into the model to provide vessel masks ŝv , and (B) the other takes the real backgrounds x b 0 and the fractal masks s f for the model to generate fake angiograms xa . Here, as shown in Figure 2 (B), since the input fractal masks can be regarded as vessel segmentation labels of the fake angiograms, we forward the fake angiograms generated through the (B) path to the (A) path, and apply cycle consistency between the estimated segmentation masks and the fractal masks to capture the vessel information.

3.1.1. LOSS FUNCTION

To train the model, we employ LSGAN (Mao et al., 2017) framework, which leads to the alternating application of the following two optimization problems: min θ,G L G (ϵ θ , G, D s , D a ), min Ds,Da L D (ϵ θ , G, D s , D a ), where L G , and L D denotes the losses for the diffusion/generator and discriminator, respectively, which are given by: L G (ϵ θ , G, D s , D a ) = L dif f (ϵ θ ) + αL G adv (ϵ θ , G, D s , D a ) + βL cyc (ϵ θ , G), L D (ϵ θ , G, D s , D a ) = L Ds adv (ϵ θ , G, D s ) + L Da adv (ϵ θ , G, D a ), where α and β are hyperparameters, L dif f is the diffusion loss, L adv is adversarial loss, and L cyc is cyclic reconstruction loss. The detailed description of each loss function is as follows. Diffusion loss Recall that the diffusion module learns the distribution of images to estimate meaningful latent features of the inputs. We follow the standard loss for DDPM training (Ho et al., 2020) : L dif f (ϵ θ ) := E t,x0,ϵ ∥ϵ -ϵ θ ( √ α t x 0 + √ 1 -α t ϵ, t)∥ 2 . ( ) where ϵ ∼ N (0, I). In particular, to let the diffusion module represent the vessels of angiograms effectively, we define the diffusion loss on the background images, i.e. x 0 = x b 0 in the (B) path and set the sampling schedule in t ∈ [0, T ]. Accordingly, the diffusion module is trained intensively to learn the background image signal, allowing the module in the (A) path to regard the vessel structures of the angiograms as outlier and represent vessels in the latent features. Adversarial loss To generate both vessel segmentation masks and synthetic angiograms without the ground-truth labels, the proposed model is trained by adversarial learning using the two discriminators D s and D a . As shown in Figure 1 , the discriminator D s attempts to distinguish the estimated segmentation masks ŝv from the real fractal mask s f (in the (A) path), while the discriminator D a tries to discriminate between the generated angiograms xa and the real aniography images x a 0 (in the (B) path). As we employ LSGAN (Mao et al., 2017) , the adversarial loss of generator L G adv can be formulated by: L G adv (ϵ θ , G,D s ,D a ) = E x a [(D s (G(ϵ θ (x a ); 0)) -1) 2 ] + E x a ,s f [(D a (G(ϵ θ (x b ); s f )) -1) 2 ]. On the other hand, the discriminators are trained to compete against the generator with the adversarial loss functions, L Ds adv and L Da adv , which are defined by: L Ds adv (ϵ θ , G, D s ) = 1 2 E s f [(D s (s f ) -1) 2 ] + 1 2 E x a [(D s (G(ϵ θ (x a ); 0))) 2 ], L Da adv (ϵ θ , G, D a ) = 1 2 E x a 0 [(D a (x a 0 ) -1) 2 ] + 1 2 E x a ,s f [(D a (G(ϵ θ (x b ); s f ))) 2 ]. This adversarial loss enables the single generator G to fool the discriminator D s and D a , by generating realistic segmentation masks ŝv = G(ϵ θ (x a ); 0) and angiograms xa = G(ϵ θ (x b ); s f ). In contrast, the discriminators attempt to distinguish these generated images being fake and the real images of s f and x a 0 being real. Figure 3 : Vessel segmentation according to the noise level t a . Our model estimates the segmentation masks ŝv using the latent features ϵ θ for the noisy angiograms x a ta . s v is the ground-truth label. Cyclic reconstruction loss For the generator G to capture the semantic information of the vessels, we also constrain our model with the cyclic reconstruction loss on the fractal masks. Specifically, as the vessel-like fractal masks s f can be labels for the synthetic angiograms xa generated in the (B) path, we feed the xa into our model and reconstruct the fractal masks by the (A) path. Therefore, the cyclic reconstruction loss is computed between the reconstructed segmentation masks and the real fractal masks, which can be written by: L cyc (ϵ θ , G) = E x b ,s f [||G(ϵ θ (G(ϵ θ (x b ); s f )); 0) -s f || 1 ]. Here, we solve the segmentation problem as a vessel mask image generation, which is why we use L1 loss in the cyclic loss.

3.1.2. IMAGE PERTURBATION FOR THE MODEL INPUT

Given real images of x a 0 and x b 0 , our diffusion module takes noisy angiograms x a ta in the (A) path and noisy background images x b t b in the (B) path as the input, in which each noisy image is sampled based on the forward diffusion process (2): x t = √ α t x 0 + √ 1 -α t ϵ, where ϵ ∼ N (0, I), and both of t a and t b are uniformly sampled time step in [0, T ]. Here, for the diffusion module not only to learn the background image signal in the (B) path but also to provide useful information for the generation module to segment the vessel structures under even certain noisy angiogram images in the (A) path, we sample t a in the range of [0, T a ] where T a < T . Empirically, we found that this makes our model learn vessel representations robust to the noise.

3.2. INFERENCE OF VESSEL SEGMENTATION

The inference phase of DARL is different from the conventional diffusion model in that our model do not require iterative reverse process, similar to the recent diffusion-based unsupervised learning method called DiffuseMorph (Kim et al., 2021) . Specifically, once the proposed DARL is trained, in the inference, we can obtain the vessel segmentation masks of angiograms from the (A) path by one step. For the noisy angiograms x a ta given by the forward diffusion process (17), our model provides the vessel segmentation masks using the latent features ϵ θ (x a ta , t a ) estimated from the diffusion module. As shown in Figure 3 , our model can generate the segmentation masks for any noise level t a within a certain range (i.e. [0, T a ]). Nevertheless, since the angiography image x a Datasets To realize the self-supervised learning framework, we train our model with the publicly available unlabeled X-ray coronary angiography disease (XCAD) dataset obtained during stent placement surgery and generated synthetic fractal masks (Ma et al., 2021) . When training the network, each angiography and background data is independently sampled. Also, in testing, we utilize external 134 XCA (Cervantes-Sanchez et al., 2019) and 30 XCA (Hao et al., 2020) datasets. Furthermore, we evaluate cross-organ generalization capability on retinal imaging datasets; DRIVE (Staal et al., 2004) and STARE (Hoover & Goldbaum, 2003) . Details of the datasets are in Appendix B. Implementation details Our model is implemented by employing the network architectures proposed in DDPM (Ho et al., 2020) and SPADE (Park et al., 2019) for the diffusion module and the generation module, respectively. Also, for the discriminators, we use the network of PatchGAN (Isola et al., 2017) . To train the model, we set the number of time steps as T = 2000 with the linearly scheduled noise levels from 10 -6 to 10 -2 . Within this range, we sample the noisy angiograms by setting T a to 200. Also, we set the hyperparameters of loss function as α = 0.2 and β = 5. Our model is optimized by using the Adam algorithm (Kingma & Ba, 2014) with a learning rate of 5×10 -6 on a single GPU card of Nvidia Quadro RTX 6000. We train the model for 150 epochs, and the model in the epoch with the best performance on the validation set is used for test data. All the implementations are done using the library of PyTorch (Paszke et al., 2019) in Python. The details of network structures and hyperparameter setting can be found in Appendix.

Baseline methods and metrics

We compare our model to several baseline methods of un-/selfsupervised learning, which do not require ground-truth vessel labels. For unsupervised learning methods, we utilize Spatial-Guided Clustering (SGC) (Ahn et al., 2021) , Redrawing (Chen et al., 2019) , and Deep Spectral (DS) (Melas-Kyriazi et al., 2022) . For self-supervised learning methods, we employ Self-supervised Transformer with Energy-based Graph Optimization (STEGO) (Hamilton et al., 2022) , Deep Adversarial (DA) (Mahmood et al., 2019) , and Self-Supervised Vessel Segmentation (SSVS) (Ma et al., 2021) . All these methods are implemented under identical training conditions to our model, unless the method needs no training procedure. For baseline methods that require heuristic thresholds, optimal performance is achieved by selecting data-specific thresholds within the range from 0.2 to 0.8 in increments of 0.1. To quantitatively evaluate the segmentation performance, we compute Intersection over Union (IoU), Dice similarity coefficient, and Precision.

4.1. EXPERIMENTAL RESULTS

Figure 4 shows the vessel segmentation masks from the baseline methods and our proposed method on three different coronary angiography datasets and two retinal imaging datasets. Quantitative evaluation results of the methods are presented in Table 1 . The analysis of the results is as follows. Comparison of ours to baselines When we compare the proposed method to the baselines, our model segments vessel structures including tiny branches more accurately. Also, as shown in Table 1, our model consistently achieves the SOTA performance by large margin compared to existing unsupervised and self-supervised methods. In specific, our network shows significantly improved precision scores, which demonstrates advantages of our DARL that effectively differentiates foreground vessel structure and eliminates false positive signals from the noisy backgrounds. Generalization capability To verify that our trained DARL can be generally used for various vessel images taken from different machines or different anatomic region-of-interests (ROI), we evaluate the generalization capability by applying the models only trained on the XCAD dataset to the other datasets. First, for the external 134 XCA and 30 XCA datasets which have different resolutions and noise distributions to those of the XCAD dataset, as shown in Figure 4 and Table 1 , our model achieves higher performance than the others. Also, with the DRIVE and STARE retinal datasets that have unseen data distributions due to the different modalities from the XCAD, our DARL shows the most promising cross-organ generalization performance. This may come from the proposed framework that reuses the generated angiography images for the segmentation process through the cycle path, diversifying the input data distribution. Also, the diffusion module learning the stochastic diffusion process enables our model to be used in general for vessel segmentation. Robustness to noises As X-ray images are often acquired under low-dose radiation exposure to reduce potential risks, we further evaluate the performance of our model on simulated noisy angiograms. Using the XCAD dataset, we add Gaussian noise to the angiogram with different levels of σ =10, 25, and 50. We show the segmentation results according to the noise levels in Figure 4 . Also, we report the quantitative evaluation results in Table 2 . It is noteworthy that our DARL is the only method to segment vessel structures with reasonable performance under noise corruption. Since the proposed segmentation method is trained through the diffusion module that perturbs the input images, the model is highly robust to segment vessel structure even from the noisy data.

Latent representation

To study the origin of the performance improvement, in Figure 5 , we show the latent features ϵ θ (x t , t) given x 0 for (A) the angiography x 0 = x a 0 and (B) the 0 with t = 100, respectively. In contrast to the (B) path, the latent representation in the (A) path emphasizes the vessel structures. This implies that although there are no ground-truth labels, our model learns the background image representation so that the vessel structure can be captured as outlier, leading to improved segmentation performance. Ablation study Table 3 shows the evaluation results of several ablation studies. Implementation details and visual results are in Appendix D.1. (a) Our model without the diffusion module and L dif f shows lower performance by about 2% for all metrics compared to our model, which suggests that the diffusion module guides the generation module to extract vessel representation accurately. (b) The generation module without the proposed S-SPADE layers is degraded by more than 1% over (a) for all metrics, verifying that our SPADE-based unified generator effectively captures vessel semantic information through the synergy of learning both image segmentation and generation. (c) Through the implementation of our model without the proposed cyclic loss L cyc , we verify that L cyc allows our model to segment proper vessel regions. (d) When training our model by converting the L1 loss for L cyc to the cross-entropy (CE) loss, the performance is much worse than ours in all metrics, which implies that our approach using L1 loss for the cycle path is proper to obtain the vessel masks. 

5. CONCLUSION

We present a non-iterative diffusion model called DARL for self-supervised vessel segmentation. Our model composed of the diffusion and generation modules learns vessel representation without labels via adversarial learning, in the guidance of latent features estimated from the diffusion module. Also, through the proposed switchable SPADE layer, we generate synthetic angiograms as well as vessel segmentation masks, leading to learning semantic information about vessels more effectively. Although the diffusion module training is combined with other loss functions, the inference is not iterative but only done in one step, which makes it faster and unique compared to the existing diffusion models. Using various medical vessel datasets, we verify that our model is much superior to existing un-/self-supervised learning methods. Moreover, thanks to the diffusion module, our model is robust to image diversity and noise, suggesting that our model can be an important platform for designing a general vessel segmentation model.

REPRODUCIBILITY

Source code is available at https://github.com/bispl-kaist/DARL.

A DETAILS OF NETWORK ARCHITECTURE

In this section, we provide details of the generator G proposed in our diffusion adversarial representation learning (DARL) model, which is composed of the diffusion module and the generation module. For the diffusion module, we adapt the network architecture of DDPM (Ho et al., 2020) that has U-Net (Ronneberger et al., 2015) structure, as described in Table 4 . The generation module is composed of four consecutive residual blocks (He et al., 2016) with switchable spatially-adaptive denormalization (SPADE) layers, as described in Table 5 . Table 4 : Detailed network architecture of the diffusion module. For each block (blk), C i,j is the convolution layer with i × i kernel and stride length of j, RS i pairs are entry points for residual shortcut path within a block unit, RB is the residual block module, and SA is the self-attention module. GN is the group normalization, and Ch indicates the size of output channel dimension.

Blk Diffusion module Ch Downstream

Upstream  1 C 3,1 RS 1 RB RS 2 RB RS 3 RS 3 RB RS 2 RB RS 1 RB CB 1 2 C 3,2 RS 1 RB RS 2 RB RS 3 RS 3 RB RS 2 RB RS 1 RB UP 64 3 C 3,2 RS 1 RB RS 2 RB RS 3 RS 3 RB RS 2 RB RS 1 RB UP 128 4 C 3,2 RS 1 RB RS 2 RB RS 3 RS 3 RB RS 2 RB RS 1 RB UP 128 5 C 3,2 RS 1 RB SA RS 2 RB SA RS 3 RS 3 RB SA RS 2 RB SA RS 1 RB SA UP 256 6 C 3,2 RS 1 RB RS 2 RB RS 3 RS 3 RB RS 2 RB RS 1 RB UP 256 Mid RB SA RB Note: RB = [RS n -GN -Swish -C 3,1 -GN -Swish -C 3,1 -RS n ], SA = [GN -C 1 -C 1 ], UP = [Upsample -C 3,1 ], CB = [GN -Swich -C 3 ] In C 7 IN ReLU 64 DownBlock1 C 3,2 IN ReLU 128 DownBlock2 C 3,2 IN ReLU 256 MidResBlock1 RS 1 C 3,1 ReLU S-SPADE C 3,1 ReLU S-SPADE RS 1 256 MidResBlock2 RS 2 C 3,1 ReLU S-SPADE C 3,1 ReLU S-SPADE RS 2 256 UpResBlock1 UP RS 3 C 3,1 ReLU S-SPADE C 3,1 ReLU S-SPADE RS 3 128 UpResBlock2 UP RS 4 C 3,1 ReLU S-SPADE C 3,1 ReLU S-SPADE RS 4 64 Out C 7 1 B DETAILS OF DATASET For training the network, as described in Table 6 , we utilize the XCAD dataset (Ma et al., 2021) which provides a total of 1,621 unlabeled X-ray coronary angiography frames. We use each first frame that is taken before the contrast agent injection as the real background image. We also generate 1,621 synthetic fractal masks by using the fractal synthetic module proposed by (Ma et al., 2021) . The fractal masks are synthesized by drawing rectangles with randomly sampled thickness ranging from 15 to 25 pixels on a black background with a size of 512 x 512. Then, local distortions are taken to each rectangle, including affine transformation with a random scale and rotations with a random angle, resulting in generating masks with various shapes and thicknesses. This reduces the effort to match the real vessel thickness distribution, thus, one can simply synthesize such various fractal masks through the fractal synthetic module. Additional 126 angiography images, along with the ground-truth vessel masks annotated by experienced radiologists, are divided into validation and test sets by 10% and 90%, respectively. We subsample all data into 256×256. For training the baseline methods, we utilize the same amount of angiography images from the XCAD dataset. In specific, our method and SSVS utilize angiography images, background images, and synthetic fractal masks for training both segmentation and generation paths, but each data is randomly sampled independently. DA utilizes angiography images and fractal masks for adversarial training. For inferencing DS, we utilize pre-trained network parameters. Redrawing, STEGO and SGC are basically clustering-based methods, which need only angiography images. For all these methods, we use the same 256 × 256 images as ours, without any further image processing such as normalization, but augment data through random flipping and 90-degree rotation. For external test dataset, we utilize two X-ray coronary angiography (XCA) datasets acquired from different machines. 134 XCA dataset is composed of 134 angiography images with the vessel masks labeled by an expert cardiologist (Cervantes-Sanchez et al., 2019) . 30 XCA datset is composed of 30 sequences of angiography images (Hao et al., 2020) . We utilize one angiography image from each sequence, along with its corresponding ground-truth vessel mask labeled by experts. All the test images are resized to 512×512. Furthermore, for evaluating cross-organ generalization capability, we utilize retinal imaging datasets. We use (Staal et al., 2004) and STARE (Hoover & Goldbaum, 2003) datasets, each of which is composed of 20 retinal images and the corresponding expert-labeled vessel masks. Since retinal imaging is taken under high-resolution, we resize the image into 768×768 and split into 9 patches with 256×256. As described in the main paper, a single generation module with the switchable SPADE layers in our model provides both the synthetic angiograms and the vessel segmentation masks by onestep inference, compared to iterative inference steps of other diffusion models. To evaluate the angiogram synthesis performance, we compare our model with the methods of DA and SSVS. These baselines generate the segmentation masks and the synthetic angiograms, but unlike ours, they use two different networks employing the CycleGAN (Zhu et al., 2017) framework. Furthermore, in this study, we adopt an additional baseline method of OASIS (Sushko et al., 2020) , one of the SOTA semantic image synthesis models. Under the same condition as ours with an unpaired dataset setting, we train the OASIS model. For all the baseline methods, a total of 1,621 synthetic angiography images are generated using the backgrounds and fractal masks as inputs. Figure 6 compares the visual results of synthetic angiograms. Compared to the others, our generation module yields the most realistic images that naturally reflect fractal masks on the background and also contain even tiny branches. This verifies that our model maintains consistency on the vessel-like fractal signals, capturing vessel semantic information effectively and leading to the improvement of segmentation performance. Also, we perform the quantitative evaluation on image generation of angiograms and vessel masks using Fréchet inception distance (FID) (Heusel et al., 2017) . Table 7 shows that our model achieves much lower FID scores than the comparative methods even though the FID is originally designed for evaluating natural image synthesis, which suggests the superiority of ours in generating angiography images. Also, the proposed DARL model provides the most realistic segmentation masks over the other methods.

C.2 STUDY ON HYPERPARAMETER SETTING

In the main paper, we report the vessel segmentation results from the model trained with α = 0.2 and β = 5 based on the study of hyperparameter setting, which yields optimal performance in our experiments. To study the effects of hyperparameters on the segmentation performance, using the proposed loss function, we trained our model with the fixed β = 5.0 when adjusting the parameter α. Similarly, α is fixed with 0.2 when β is adjusted. Figure 7 shows graphs of the quantitative evaluation results of IoU, Dice, and Precision metrics according to the hyperparameters of α and β. We can see that our model can learn the semantic vessel segmentation when the parameter α that controls the adversarial loss L adv is equal to or more than 0.2, though the performance gradually decreases as α increases. Also, the results show that the highest performance for all metrics is achieved when α = 0.2. On the other hand, while our model hardly captures the semantic information of vessels when there is no cycle path in network training (i.e. β = 0), the model can provide plausible vessel segmentation masks as long as the cycle path exists. Also, when we investigate the segmentation performance according to the β that weights the cyclic loss L cyc , the optimal performance is obtained when β = 5.0.

C.3 STUDY ON ANGIOGRAM PERTURBATION IN MODEL TRAINING

When training our model, the background images are perturbed by the forward diffusion process with the uniformly sampled time step t b ∈ [0, T ], whereas the angiograms are perturbed with the time step t a ∈ [0, T a ] where T a < T . To investigate the effect of time step size T a for the angiogram perturbation on model performance, we train our model with different time step sizes by setting T a as 100, 200, 500, and 1000. Figure 8 shows the quantitative evaluation results. When T a is less than 200, the vessel segmentation performance on the clean angiograms gets better, but the performance is degraded on the simulated low-dose angiograms corrupted by Gaussian noise with σ levels of 25 and 50. Also, when T a is set over 500, the model shows drastically low performance due to the lack of vascular information from the noisy angiograms. These results imply that the diffusion module can optimally provide latent features including vascular structures as long as the model is trained Implementation detail In Section 4.1 of the main paper, we implemented several ablated models of ours. For the methods without the diffusion module (i.e. (a) and (b) in Table 3 ), the real angiography images are given for the (A) path, and the real background and synthetic fractal masks are given for the (B) path. Here, as there is no diffusion module, the input images are not perturbed. In particular, for the methods without S-SPADE layers in the generation module (i.e. (b) in Table 3 ), there are two independent generation modules for the (A) and (B) paths, with instance normalization and SPADE as normalization layer for each, so that the input image for each path is given to the corresponding module. On the other hand, for the ablation studies on the loss functions, the data flows are the same as ours.

Visual results

Figure 11 shows the qualitative comparison results for the ablation studies in the main paper. (a) and (b) show that our model trained without the diffusion module generates the vessel masks including many false positive regions. When compared ours with (c), we can observe that the cycle consistency on the fake vessels allows our model to segment tiny vessels more accurately. Moreover, the comparison of (d) and ours verify that the proposed model achieves better performance, although we solve the segmentation problem with the image generation in that we use L1 loss for the segmentation masks.

D.2 RESULTS OF OUR DARL MODEL

In this section, we provide additional vessel segmentation results that show the success of our DARL in self-supervised segmentation. Figure 12 shows that our model consistently provides the best performance on XCAD dataset. Also, our model segments vascular structures better than the other baselines even on the external angiography datasets (134 XCA and 30 XCA) and the retinal image datasets (DRIVE and STARE). These results suggest that the proposed model can be used as a general vessel segmentation model for various vascular images. Also, Figure 13 shows the vessel segmentation results on the XCAD data that are corrupted by Gaussian noise with different levels of σ. The visual results demonstrate that our DARL is the only method which endures harsh corruption and outperforms the baseline methods. 



can be considered as one of the clean target images, the closer t a is to zero, the better the vessel segmentation performance. Therefore, we test our model by setting t a = 0.4 EXPERIMENTSIn this section, we thoroughly evaluate the vessel segmentation performance of our method. We firstly compare the proposed DARL to existing unsupervised and self-supervised baseline models on various angiography datasets, including X-ray coronary angiography and retinal images. Also, we study the noise robustness of our model. Then, we analyze the success of our model in vessel representation and conduct an ablation study.



Figure 1: Our proposed diffusion adversarial representation model for self-supervised vessel segmentation. In path (A), given a real noisy angiography image x a ta , our model estimates vessel segmentation masks ŝv . In path (B), given a real noisy background image x b t b and a vessel-like fractal mask s f , our model generates a synthetic angiography image xa .

Figure 2: Training flow of our model. The generation module G with the switchable SPADE layers takes ϵ θ and the noisy images, and generates desired outputs corresponding to the paths. x a ta and x b t b denote the noisy angiography and background images, where t a and t b are noise schedules. ŝv is the generated vessel segmentation, and xa is the synthetic angiography images. s f is the vessel-like fractal masks. Cat denotes the concatenation of images in channel dimension. L dif f , L adv , and L cyc are the diffusion loss, the adversarial loss, and the cycle loss, respectively.

Figure 4: Visual comparison results on the vessel segmentation of various angiography images.

Figure 5: Estimated latent features ϵ θ in the (A) and (B) paths of our model.

Figure 6: Visual comparison results of the generated fake angiograms. Yellow boxes in ×1 rows are magnified by two or three times in the corresponding bottom rows, respectively. Red triangles indicate remarkable points.

Figure 7: Vessel segmentation performance of our model on the XCAD dataset according to the hyper-parameters of the proposed loss function. Each column shows the average values of quantitative evaluation results with respect to (a) α for the adversarial loss L adv and (b) β for the cyclic reconstruction loss cyc .

Figure 8: Vessel segmentation performance of our model according to the settings of T a for the angiogram perturbation. Each graph shows the quantitative evaluation results with respect to the Gaussian noise level σ.

Figure 11: Visual comparison results of the ablation study. (a-d) correspond to each case study (a-d) in Table3of the main paper. Yellow boxes denote remarkable parts.

Figure 13: Additional visual comparison results on XCAD datasets with different levels (σ) of Gaussian noise.

Quantitative evaluation results on the vessel segmentation of various angiography images. ±0.034 0.059 ±0.032 0.366 ±0.105 0.146 ±0.070 0.375 ±0.066 0.410 ±0.087 0.471 ±0.076 Dice 0.111 ±0.060 0.109 ±0.056 0.526 ±0.131 0.249 ±0.103 0.542 ±0.073 0.575 ±0.091 0.636 ±0.072 Precision 0.062 ±0.034 0.139 ±0.081 0.469 ±0.127 0.152 ±0.077 0.557 ±0.115 0.590 ±0.119 0.701 ±0.115 ±0.035 0.056 ±0.018 0.256 ±0.110 0.134 ±0.081 0.190 ±0.155 0.318 ±0.128 0.426 ±0.059 Dice 0.085 ±0.063 0.105 ±0.033 0.394 ±0.159 0.228 ±0.109 0.291 ±0.217 0.468 ±0.156 0.595 ±0.058 Precision 0.047 ±0.036 0.058 ±0.019 0.280 ±0.123 0.136 ±0.088 0.506 ±0.201 0.592 ±0.125 0.781 ±0.118 ±0.039 0.048 ±0.022 0.339 ±0.086 0.191 ±0.072 0.298 ±0.109 0.324 ±0.146 0.427 ±0.184 Dice 0.150 ±0.064 0.091 ±0.040 0.499 ±0.113 0.314 ±0.100 0.447 ±0.148 0.468 ±0.193 0.572 ±0.205 Precision 0.090 ±0.041 0.144 ±0.074 0.525 ±0.130 0.200 ±0.081 0.612 ±0.174 0.613 ±0.212 0.729 ±0.152 ±0.055 0.057 ±0.033 0.217 ±0.143 0.152 ±0.073 0.245 ±0.090 0.314 ±0.101 0.372 ±0.148 Dice 0.115 ±0.093 0.105 ±0.059 0.333 ±0.201 0.257 ±0.106 0.386 ±0.117 0.469 ±0.119 0.525 ±0.161 Precision 0.069 ±0.061 0.199 ±0.155 0.243 ±0.175 0.169 ±0.100 0.503 ±0.218 0.549 ±0.216 0.617 ±0.271 STARE IoU 0.055 ±0.045 0.074 ±0.048 0.180 ±0.141 0.125 ±0.076 0.237 ±0.122 0.311 ±0.148 0.368 ±0.191 Dice 0.101 ±0.077 0.134 ±0.080 0.281 ±0.201 0.216 ±0.109 0.367 ±0.167 0.454 ±0.185 0.508 ±0.216 Precision 0.058 ±0.047 0.227 ±0.157 0.205 ±0.172 0.135 ±0.092 0.427 ±0.233 0.490 ±0.230 0.537 ±0.280

Results of noise robustness test according to the Gaussian noise with σ. ±0.033 0.052 ±0.031 0.331 ±0.104 0.144 ±0.073 0.353 ±0.065 0.258 ±0.079 0.451 ±0.080 Dice 0.122 ±0.059 0.096 ±0.053 0.487 ±0.133 0.245 ±0.107 0.519 ±0.073 0.404 ±0.099 0.617 ±0.076 Precision 0.069 ±0.035 0.126 ±0.077 0.480 ±0.135 0.157 ±0.091 0.481 ±0.104 0.477 ±0.117 0.710 ±0.115 25 IoU 0.069 ±0.035 0.036 ±0.021 0.232 ±0.094 0.118 ±0.064 0.247 ±0.072 0.059 ±0.033 0.389 ±0.088 Dice 0.128 ±0.061 0.069 ±0.039 0.366 ±0.132 0.206 ±0.095 0.391 ±0.092 0.109 ±0.058 0.554 ±0.092 Precision 0.072 ±0.036 0.095 ±0.058 0.446 ±0.159 0.144 ±0.115 0.371 ±0.106 0.149 ±0.082 0.727 ±0.119 50 IoU 0.070 ±0.025 0.020 ±0.012 0.077 ±0.065 0.060 ±0.050 0.102 ±0.056 0.021 ±0.013 0.269 ±0.081 Dice 0.130 ±0.045 0.040 ±0.022 0.136 ±0.109 0.108 ±0.088 0.180 ±0.091 0.041 ±0.025 0.417 ±0.101 Precision 0.072 ±0.026 0.061 ±0.038 0.221 ±0.168 0.076 ±0.067 0.169 ±0.094 0.060 ±0.038 0.716 ±0.147

Results of ablation study on the proposed model and loss function.

Detailed network architecture of the generation module. UP is the nearest neighbor upsampling function, RS i pairs are entry points for residual shortcut path, C i,j is the convolution layer with i × i kernel and stride length of j, and IN is the instance normalization layer. S-SPADE is the proposed switchable SPADE layer that turns on SPADE if the semantic layout is provided, otherwise turns off SPADE and applies IN. Ch indicates the size of output channel dimension.

Detailed dataset for training each (A) segmentation and (B) generation path.

Quantitative comparison results using FID score on the image generation of vessel masks and angiograms. Lower FID means that the image generation is more realistic.

Vessel segmentation performance of our model without the discriminator D s /D a .

The Dice scores for the segmentation performance of unsupervised methods applying HE and CLAHE to the input images.Our model has a unified generator module which can perform both the segmentation and the generation tasks simultaneously, by efficiently decreasing training complexity compared to CycleGAN structure of other adversarial learning framework such like DA or SSVS. We estimate floating point operation per second (FLOPS) of each method in Table12, and prove the cost-effective characteristic of our model.

Training complexity (FLOPS).

ACKNOWLEDGMENTS

This work was supported in part by the National Research Foundation of Korea under Grant NRF-2020R1A2B5B03001980, in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT, Ministry of Science and ICT) (No. 2022-0-00984, Development of Artificial Intelligence Technology for Personalized Plugand-Play Explanation and Verification of Explanation), in part by the MSIT(Ministry of Science and ICT), under the ITRC(Information Technology Research Center) support program(IITP-2022-2020-0-01461) supervised by the IITP, and in part by the KAIST Key Research Institute (Interdisciplinary Research Group) Project.

C.5 STUDY ON DIFFUSION MODEL IN LATENT FEATURE ESTIMATION

Recall that the output of the diffusion module is not a simple latent feature of networks but a score function that has spatial information of the data. show the effect of the diffusion model in our framework, we additionally study our framework by replacing the diffusion module with the autoencoder model. For a fair comparison, We configure the autoencoder by adapting the same DDPM network architecture (Ho et al., 2020) as ours but removing the time embedding vectors and then train without the diffusion loss. Figure 9 shows the latent features and vessel masks of the angiography images. We can observe that while the autoencoder model estimates latent features that include vessels and other similar confusing structures, our proposed framework with the diffusion module represents vessels better in the latent features and provides more accurate segmentation masks. Table 9 also shows that the autoencoder model achieves inferior performance compared to ours. These results indicate that the latent features from the diffusion module allow the generation module to effectively learn vessel representation. The contribution of proposed DARL model is further analyzed under supervised, semi-supervised, and self-supervised learning scenarios, and each result on various datasets are described in Table 10 .Supervised Firstly, we apply the proposed framework as a supervised model. Here, since there are no labeled pairs in the training dataset, we conduct two supervised segmentation experiments utilizing the 12 pairs of labeled validation data: 1) training our model through only the (A) path, and 2) training our model by giving the labeled data to the (A) path while keeping the (B) path. To evaluate the performance of supervised approaches, we report the results of the best model among models saves at 10 epoch intervals. In Table 10 , we can observe that in the few-label scenario, our DARL framework contributes to the segmentation path to achieve superior performance, leading to getting better than the supervised model only with the (A) path. Moreover, it is noteworthy that the performance of our self-supervised model surpasses the supervised approach only using the (A) path by large margins, even though ours is trained without any supervised data.Semi-supervised As the outputs in the (A) and (B) paths of our DARL model can be used as pseudo labels or inputs for vessel segmentation, we train the segmentation model in a semisupervised manner. In specific, we prepare the segmentation network that is identical network architecture to that of our generation module G for a fair comparison. Then, we train the network by utilizing the (A) path outputs ŝv as paired pseudo-label for input x a . Similarly, we train the network by using the (B) path outputs xa as paired pseudo-input for label s f . As reported in Table 10 , when compared to the model trained using the data from the (A) path, the performance is slightly higher than ours. This suggests that the generated masks from our model can be used for pseudo-labels for unlabeled data. Also, it is remarkable that although our method estimates the vessel maps and synthetic images simultaneously, our method achieves comparable performance with the semi-supervised method using the pseudo-labels. On the other hand, the model trained using the data from the (B) path shows mostly lower performance than ours, implying the cycle path in our model is more effective to extract vessels.Self-supervised We further test the application of our model to an environment that has no background images. Our model is trained by replacing the background images x b in the (B) path with the real angiography images x a . Table 10 shows that the segmentation performance of our model without non-contrast background images is comparable or even superior to our model. As the generation module of our DARL takes latent features of input images, our model can synthesize images involving the information of input angiography images, based on semantic masks, and learn vessel features. This can be a unique characteristic compared to other semantic image synthesis models that typically take the images directly. For the implementation of comparison methods, we did not perform any image processing except for resizing the images. However, as the performance of unsupervised methods would be affected by the processing such as normalization, we additionally tested the unsupervised methods by applying several normalization methods to the input images. As shown in Figure 10 , we processed images using histogram equalization (HE) and contrast limited adaptive histogram equalization (CLAHE). Table 11 shows that the performance of unsupervised methods degrades when using the normalized data. This comes from the angiography images that are hard to visualize only vessel regions due to the confusing background structures even though the CLAHE and HE enhance the image contrast. On the other hand, our method outperforms the other comparative models, suggesting the superiority of our methods. 

