3D UX-NET: A LARGE KERNEL VOLUMETRIC CON-VNET MODERNIZING HIERARCHICAL TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION

Abstract

The recent 3D medical ViTs (e.g., SwinUNETR) achieve the state-of-the-art performances on several 3D volumetric data benchmarks, including 3D medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local selfattention and the large number of model parameters. We hypothesize that volumetric ConvNets can simulate the large receptive field behavior of these learning approaches with fewer model parameters using depth-wise convolution. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel (LK) size (e.g. starting from 7 × 7 × 7) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms Swin-UNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of 2.27% Dice (from 0.880 to 0.900).

1. INTRODUCTION

Significant progress has been made recently with the introduction of vision transformers (ViTs) Dosovitskiy et al. (2020) into 3D medical downstream tasks, especially for volumetric segmentation benchmarks Wang et al. (2021) ; Hatamizadeh et al. (2022b) ; Zhou et al. (2021) ; Xie et al. (2021) ; Chen et al. (2021) . The characteristics of ViTs are the lack of image-specific inductive bias and the scaling behaviour, which are enhanced by large model capacities and dataset sizes. Both characteristics contribute to the significant improvement compared to ConvNets on medical image segmentation Tang et al. (2022) ; Bao et al. (2021) ; He et al. (2022) ; Atito et al. (2021) . However, it is challenging to adapt 3D ViT models as generic network backbones due to the high complexity of computing global self-attention with respect to the input size, especially in high resolution images with dense features across scales. Therefore, hierarchical transformers are proposed to bridge these gaps with their intrinsic hybrid structure Zhang et al. (2022) ; Liu et al. (2021) . Introducing the "slid-ing window" strategy into ViTs termed Swin Transformer behave similarily with ConvNets Liu et al. (2021) . SwinUNETR adapts Swin transformer blocks as the generic vision encoder backbone and achieves current state-of-the-art performance on several 3D segmentation benchmarks Hatamizadeh et al. (2022a) ; Tang et al. (2022) . Such performance gain is largely owing to the large receptive field from 3D shift window multi-head self-attention (MSA). However, the computation of shift window MSA is computational unscalable to achieve via traditional 3D volumetric ConvNet architectures. As the advancement of ViTs starts to bring back the concepts of convolution, the key components for such large performance differences are attributed to the scaling behavior and global self-attention with large receptive fields. As such, we further ask: Can we leverage convolution modules to enable the capabilities of hierarchical transformers? The recent advance in LK-based depthwise convolution design (e.g., Liu et al. Liu et al. (2022) ) provides a computationally scalable mechanism for large receptive field in 2D ConvNet. Inspired by such design, this study revisits the 3D volumetric ConvNet design to investigate the feasibility of (1) achieving the SOTA performance via a pure ConvNet architecture, (2) yielding much less network complexity compared with 3D ViTs, and (3) providing a new direction of designing 3D ConvNet on volumetric high resolution tasks. Unlike SwinUNETR, we propose a lightweight volumetric ConvNet 3D UX-Net to adapt the intrinsic properties of Swin Transformer with ConvNet modules and enhance the volumetric segmentation performance with smaller model capacities. Specifically, we introduce volumetric depth-wise convolutions with LK sizes to simulate the operation of large receptive fields for generating self-attention in Swin transformer. Furthermore, instead of linear scaling the self-attention feature across channels, we further introduce the pointwise depth convolution scaling to distribute each channel-wise feature independently into a wider hidden dimension (e.g., 4×input channel), thus minimizing the redundancy of learned context across channels and preserving model performances without increasing model capacity. We evaluate 3D UX-Net on supervised volumetric segmentation tasks with three public volumetric datasets: 1) MICCAI Challenge 2021 FeTA (infant brain imaging), 2) MICCAI Challenge 2021 FLARE (abdominal imaging), and 3) MICCAI Challenge 2022 AMOS (abdominal imaging). Surprisingly, 3D UX-Net, a network constructed purely from ConvNet modules, demonstrates a consistent improvement across all datasets comparing with current transformer SOTA. We summarize our contributions as below: • We propose the 3D UX-Net to adapt transformer behavior purely with ConvNet modules in a volumetric setting. To our best knowledge, this is the first large kernel block design of leveraging 3D depthwise convolutions to compete favorably with transformer SOTAs in volumetric segmentation tasks. • We leverage depth-wise convolution with LK size as the generic feature extraction backbone, and introduce pointwise depth convolution to scale the extracted representations effectively with less parameters. • We use three challenging public datasets to evaluate 3D UX-Net in 1) direct training and 2) finetuning scenarios with volumetric multi-organ/tissues segmentation. 3D UX-Net achieves consistently improvement in both scenarios across all ConvNets and transformers SOTA with fewer model parameters. 

3. 3D UX-NET: INTUITION

Inspired by Liu et al. (2022) , we introduce 3D UX-Net, a simple volumetric ConvNet that adapts the capability of hierarchical transformers and preserves the advantages of using ConvNet modules such as inductive biases. The basic idea of designing the encoder block in 3D UX-Net can be divided into Figure 2 : Overview of the proposed 3D UX-Net with our designed convolutional block as the encoder backbone. LK convolution is used to project features into patch-wise embeddings. A downsampling block is used in each stage to mix and enrich context across all channels, while our designed blocks extract meaningful features in depth-wise setting. 1) block-wise and 2) layer-wise perspectives. First, we discuss the block-wise perspective in three views: • Patch-wise Features Projection: Comparing the similarities between ConvNets and ViTs, there is a common block that both networks use to aggressively downscale feature representations into particular patch sizes. Here, instead of flattening image patches as a sequential input with linear layer Dosovitskiy et al. (2020) , we adopt a LK projection layer to extract patch-wise features as the encoder's inputs. • Volumetric Depth-wise Convolution with LKs: One of the intrinsic specialties of the swin transformer is the sliding window strategy for computing non-local MSA. Overall, there are two hierarchical ways to compute MSA: 1) window-based MSA (W-MSA) and 2) shifted window MSA (SW-MSA). Both ways generate global receptive field across layers and further refine the feature correspondence between non-overlapping windows. Inspired by the idea of depth-wise convolution, we have found similarities between the weighted sum approach in self-attention and the convolution per-channel basis. We argue that using depth-wise convolution with a LK size can provide a large receptive field in extracting features similar to the MSA blocks. Therefore, we propose compressing the window shifting characteristics of the Swin Transformer with a volumetric depth-wise convolution using a LK size (e.g., starting from 7 × 7 × 7). Each kernel channel is convolved with the corresponding input channel, so that the output feature has the same channel dimension as the input. • Inverted Bottleneck with Depthwise Convolutional Scaling: Another intrinsic structure in Swin Transformer is that they are designed with the hidden dimension of the MLP block to be four times wider than the input dimension, as shown in Figure 1 . Such a design is interestingly correlated to the expansion ratio in the ResNet block He et al. (2016) . Therefore, we leverage the similar design in ResNet block and move up the depth-wise convolution to compute features. Furthermore, we introduce depthwise convolutional scaling (DCS) with 1 × 1 × 1 kernel to linearly scale each channel feature independently. We enrich the feature representations by expanding and compressing each channel independently, thus minimizing the redundancy of cross-channel context. We enhance the cross-channel feature correspondences with the downsampling block in each stage. By using DCS, we further reduce the model complexity by 5% and demonstrates a comparable results with the block architecture using MLP. The macro-design in convolution blocks demonstrates the possibility of adapting the large receptive field and leveraging similar operation of extracting features compared with the Swin Transformer. We want to further investigate the variation between ConvNets and the Swin Transformer in layer-wise settings and refine the model architecture to better simulate ViTs in macro-level. Here, we further define and adapt layer-wise differences into another three perspectives: • Applying Residual Connections: From Figure 1 , the golden standard 3D U-Net block demonstrates the naive approach of using small kernels to extract local representations with increased channels C ¸ic ¸ek et al. ( 2016), while the SegResNet block applies the residual similar to the transformer block Myronenko (2018) . Here, we also apply residual connections between the input and the extracted features after the last scaling layer. However, we do not apply any normalization and activation layers before and after the summation of residual to be equivalent with the swin transformer structure. 2016). • Using GELU as the Activation Layer: Many previous works have used the rectified linear unit (ReLU) activation layers Nair & Hinton (2010) , providing non-linearity in both ConvNets and ViTs. However, previously proposed transformer models demonstrate the Gaussian error linear unit (GELU) to be a smoother variant, which tackle the limitation of sudden zero in the negative input range in ReLU Hendrycks & Gimpel (2016) . Therefore, we further substitute the ReLU with the GELU activation function.

4. 3D UX-NET: COMPLETE NETWORK DESCRIPTION

3D UX-Net comprises multiple re-designed volumetric convolution blocks that directly utilize 3D patches. Skip connections are further leveraged to connect the multi-resolution features to a convolution-based decoder network. Figure 2 illustrates the complete architecture of 3D UX-Net. We further describe the details of the encoder and decoder in this section.

4.1. DEPTH-WISE CONVOLUTION ENCODER

Given a set of 3D image volumes V i = X i , Y ii=1,...,L , random sub-volumes P i ∈ R H×W ×D×C are extracted to be the inputs for the encoder network. Instead of flattening the patches and projecting it with linear layer Hatamizadeh et al. (2022b) , we leverage a LK convolutional layer to compute partitioned feature map with size H 2 × W 2 × D 2 that are projected into a C = 48-dimensional space. To adapt the characteristics of computing local self-attention, we use the depthwise convolution with kernel size starting from 7 × 7 × 7 (DWC) with padding of 3, to act as a "shifted window" and evenly divide the feature map. As global self-attention is generally not computationally affordable with a large number of patches extracted in the Swin Transformer Liu et al. (2021) , we hypothesize that performing depthwise convolution with a LK size can effectively extract features with a global receptive field. Therefore, we define the output of encoder blocks in layers l and l + 1 as follows: ẑl = DWC(LN(z l-1 )) + z l-1 z l = DCS(LN(ẑ l )) + ẑl ẑl+1 = DWC(LN(z l )) + z l z l+1 = DCS(LN(ẑ l+1 )) + ẑl+1 (1) where ẑl and ẑl+1 are the outputs from the DWC layer in different depth levels; LN and DCS denote as the layer normalization and the depthwise convolution scaling, respectively (see. Figure 1 ). Compared to the Swin Transformer, we substitute the regular and window partitioning multihead self-attention modules, W-MSA and SW-MSA respectively, with two DWC layers. 

4.2. DECODER

The multi-scale output from each stage in the encoder is connected to a ConvNet-based decoder via skip connections and form a "U-shaped" like network for downstream segmentation task. Specifically, we extract the output feature mapping of each stage i(i ∈ 0, 1, 2, 3, 4) in the encoder and further leverage a residual block comprising two post-normalized 3 × 3 × 3 convolutional layers with instance normalization to stabilize the extracted features. The processed features from each stage are then upsampled with a transpose convolutional layer and concatentated with the features from the preceding stage. For downstream volumetric segmentation, we also concatenate the residual features from the input patches with the upsampled features and input the features into a residual block with 1 × 1 × 1 convolutional layer with a softmax activation to predict the segmentation probabilities.

5. EXPERIMENTAL SETUP

Datasets We conduct experiments on three public multi-modality datasets for volumetric segmentation, which comprising with 1) MICCAI 2021 FeTA Challenge dataset (FeTA2021) Payette et al. (2021) , 2) MICCAI 2021 FLARE Challenge dataset (FLARE2021) Ma et al. (2021) , and 3) MIC-CAI 2022 AMOS Challenge dataset (AMOS2022) Ji et al. (2022) . For the FETA2021 dataset, we employ 80 T2-weighted infant brain MRIs from the University Children's Hospital with 1.5 T and 3T clinical whole-body scanners for brain tissue segmentation, with seven specific tissues wellannotated. For FLARE2021 and AMOS2022, we employ 511 multi-contrast abdominal CT from FLARE2021 with four anatomies manually annotated and 200 multi-contrast abdominal CT from AMOS 2022 with sixteen anatomies manually annotated for abdominal multi-organ segmentation. More details of the three public datasets can be found in appendix A.2. Implementation Details We perform evaluations on two scenarios: 1) direct supervised training and 2) transfer learning with pretrained weights. FeTA2021 and FLARE2021 datasets are leverage to evaulate in direct training scenario, while AMOS dataset is used in transfer learning scenario. We perform five-fold cross-validations to both FeTA2021 and FLARE2021 datasets. More detailed information of data splits are provided in Appendix A.2. For the transfer learning scenario, we leverage the pretrained weights from the best fold model trained with FLARE2021, and finetune the model weights on AMOS2022 to evaluate the fine-tuning capability of 3D UX-Net. The complete preprocessing and training details are available at the appendix A.1. Overall, we evaluate 3D UX-Net performance by comparing with current volumetric transformer and ConvNet SOTA approaches for volumetric segmentation in fully-supervised setting. We use the Dice similarity coefficient as an evaluation metric to compare the overlapping regions between predictions and ground-truth labels. Furthermore, we performed ablation studies to investigate the effect on different kernel size and the variability of substituting linear layers with depthwise convolution for feature extraction. 3 further provides additional confidence of demonstrating the quality improvement in segmentation with 3D UX-Net. The morphology of organs and tissues are well preserved compared to the ground-truth label.

6.2. TRANSFER LEARNING WITH AMOS

Apart from training from scratch scenario, we further investigate the transfer learning capability of 3D UX-Net comparing to the transformers SOTA with AMOS 2022 dataset. We observe that the finetuning performance of 3D UX-Net significantly outperforms other transformer network with mean Dice of 0.900 (2.27% enhancement) and most of the organs segmentation demonstrate a consistent improvement in quality. Also, from Figure 2 , although the convergence curve of each transformer network shows the comparability to that of the FLARE2021-trained model, 3D UX-Net further shows its capability in adapting fast convergence and enhancing the robustness of the model with finetuning. Furthermore, the qualitative representations in Figure 3 demonstrates a significant improvement in preserving boundaries between neighboring organs and minimize the possibility of over-segmentation towards other organ regions.

6.3. ABLATION ANALYSIS

After evaluating the core performance of 3D UX-Net, we study how the different components in our designed architecture contribute to such a significant improvement in performance, as well as how they interact with other components. Here, both FeTA2021 and FLARE2021 are leveraged to perform ablation studies towards different modules. All ablation studies are performed with kernel size 7 × 7 × 7 scenario except the study of evaluating the variability of kernel size. Comparing with Standard Convolution: We investigate the effectiveness of both standard convolution and depthwise convolution for initial feature extraction. With the use of standard convolution, it demonstrates a slight improvement with standard convolution. However, the model parameters are about 3.5 times than that of using depthwise convolution, while the segmentation performance with depthwise convolution still demonstrates a comparable performance in both datasets. Variation of Kernel Size: From Table 3 , we observe that the convolution with kernel size 7 × 7 × 7 optimally works for FeTA2021 dataset, while the segmentation performance of FLARE2021 demonstrates the best with kernel size of 13 × 13 × 13. The significant improvement of using 13 × 13 × 13 kernel for FLARE2021 may be due to the larger receptive field provided to enhance the feature correspondence between multiple neighboring organs within the abdominal region. For FeTA2021 dataset, only the small infant brains are well localized as foreground and 7 × 7 × 7 kernel demonstrates to be optimal recpetive field to extract the tissues correspondence. Adapting DCS: We found that a significant decrement is performed without using MLP for feature scaling. With the linear scaling, the performance enhanced significantly in FLARE2021, while a slight improvement is demonstrated in FeTA2021. Interestingly, leveraging depthwise convolution with 1 × 1 × 1 kernel size for scaling, demonstrates a slightly enhancement in performance for both FeTA2021 and FLARE2021 datasets. Also, the model parameters further drops from 56.3M to 53.0M without trading off the model performance.

7. DISCUSSION

In this work, we present a block-wise design to simulate the behavior of Swin Transformer using pure ConvNet modules. We further adapt our design as a generic encoder backbone into "U-Net" like architecture via skip connections for volumetric segmentation. We found that the key components for improved performance can be divided into two main perspectives: 1) the sliding window strategy of computing MSA and 2) the inverted bottleneck architecture of widening the computed feature channels. The W-MSA enhances learning the feature correspondence within each window, while the SW-MSA strengthens the cross-window connections at the feature level between different non-overlapping windows. Such strategy integrates ConvNet priors into transformer networks and enlarge receptive fields for feature extraction. However, we found that the depth convolutions can demonstrate similar operations of computing MSA in Swin Transformer blocks. In depth-wise convolutions, we convolve each input channel with a single convolutional filter and stack the convolved outputs together, which is comparable to the patch merging layer for feature outputs in swin transformers. Furthermore, adapting the depth convolutions with LK filters demonstrates similarities with both W-MSA and SW-MSA, which learns the feature connections within a large receptive field. Our design provides similar capabilities to Swin Transformer and additionally has the advantage of reducing the number of model parameters using ConvNet modules. Another interesting difference is the inverted bottleneck architecture. Figure 1 shows that both Swin Transformer and some standard ConvNets have their specific bottleneck architectures (yellow dotted line). The distinctive component in swin transformer's bottleneck is to maintain the channels size as four times wider than the input dimension and the spatial position of the MSA layer. We follow the inverted bottleneck architecture in Swin Transformer block and move the depthwise convolution to the top similar to the MSA layer. Instead of using linear scaling, we introduce the idea of depthwise convolution in pointwise setting to scale the dense feature with wider channels. Interestingly, we found a slight improvement in performance is shown across datasets (FeTA2021: 0.872 to 0.874, FLARE2021: 0.933 to 0.934), but with less model parameters. As each encoder block only consists of two scaling layers, the limited number of scaling blocks may affect the performance to a small extent. We will further investigate the scalability of linear scaling layer in 3D as the future work.

8. CONCLUSION

We introduce 3D UX-Net, the first volumetric network adapting the capabilities of hierarchical transformer with pure ConvNet modules for medical image segmentation. We re-design the encoder blocks with depthwise convolution and projections to simulate the behavior of hierarchical transformer. Furthermore, we adjust layer-wise design in the encoder block and enhance the segmentation performance across different training settings. 3D UX-Net outperforms current transformer SOTAs with fewer model parameters using three challenging public datasets in both supervised training and transfer learning scenarios. 7 ). Additional validation studies is needed to investigate the effectiveness of both MLP and pointwise DCS, and optimizing 3D UX-Net architecture, which will be the next steps of our future work. Another observation in Table 3 is the subtle differences in model parameters between kernel size of 3 × 3 × 3 and 7 × 7 × 7. We found that the increase of both model parameters and FLOPs is also attributed to the design of decoder network. Our decoder block design further add a 3D ResNet block after the transpose convolution to further resample and mix the channel context, instead of directly perform transpose convolution in nn-UNet. A efficient block design in decoder network is demanded to be further investigated and using depthwise convolution may be another potential solution to reduce the low efficiency burden. To further reduce the burden of low training and inference efficiency, re-parameterization of LK convolutional blocks may be another promising direction to focus. Prior works have demonstrated to scale up few convolutional blocks with LK size (31×31) and propose the idea of parallel branches with small kernels for residual shortcuts Ding et al. (2022b; 2021; 2022a) . The parallel branch can then be mutually converted through equivalent transformation of parameters. For example. a branch of 1 × 1 convolution and a branch of 7 × 7 convolution, can be transferred into a single branck of 7 × 7 convolution Ding et al. (2021) . Furthermore, Hu et al. proposed online convolutional reparameterization (OREPA) to leverage a linear scaling at each branch to diversify the optimization directions, instead of applying non-linear normalization after convolution layer Hu et al. (2022) . Also, stack of small kernels are leveraged to generate similar receptive field of view as LKs with better training and inference efficiency. The effectiveness of leveraging small kernels stack and multiple parallel branches design will be further investigated as another directions of our future work.



Figure1: Overview of our proposed designed convolution blocks to simulate the behaviour of swin transformers. We leverage depthwise convolution and pointwise scaling to adapt large receptive field and enrich the features through widening independent channels. We further compare different backbones of volumetric ConvNets and Swin Transformer block architecture. The yellow dotted line demonstrates the differences in spatial position of widening feature channels in the network bottleneck.

SwinUNETR Tang et al. (2022);Hatamizadeh et al. (2022a), the complete architecture of the encoder consists of 4 stages comprising of 2 LK convolution blocks at each stage (i.e. L=8

Figure 3: Validation Curve with Dice Score for FeTA2021 (a), FLARE2021 (b) and AMOS2022 (c). 3D UX-Net demonstrates the fastest convergence rate with limited samples training (FeTA2021) and transfer learning (AMOS2022) scenario respectively, while the convergence rate is comparable to SwinUNETR with the increase of sample size training (FLARE2021).

Figure 4: Qualitative representations of tissues and multi-organ segmentation across three public datasets. Boxed are further zoomed in and visualize the significant differences in segmentation quality. 3D UX-Net shows the best segmentation quality compared to the ground-truth.

• Adapting Layer Normalization (LN): In ConvNets, batch normalization (BN) is a common strategy that normalizes convolved representations to enhance convergence and reduce overfitting. However, previous works have demonstrated that BN can lead to a detrimental effect in model generalizability Wu & Johnson (2021). Although several approaches have been proposed to have an alternative normalization techniques Salimans & Kingma (2016); Ulyanov et al. (2016); Wu & He (2018), BN still remains as the optimal choice in volumetric vision tasks. Motivated by vision transformers and Liu et al. (2022), we directly substitute BN with LN in the encoder block and demonstrate similar operations in ViTs Ba et al. (

Comparison of transformer and ConvNet SOTA approaches on the Feta 2021 and FLARE 2021 testing dataset. (*: p < 0.01, with Wilcoxon signed-rank test to all SOTA approaches)

Comparison of Finetuning performance with transformer SOTA approaches on the AMOS 2021 testing dataset.(*: p < 0.01, with Wilcoxon signed-rank test to all SOTA approaches)

Ablation studies of different architecture on FeTA2021 and FLARE2021

Albation Studies of Adapting nn-UNet architecture on the Feta 2021 and FLARE 2021 testing dataset. (*: p < 0.01, with Wilcoxon signed-rank test to all SOTA approaches, D.S: Deep Supervision) improvement may mainly contribute to its innovation of self-configuration training strategies and ensembling outputs as postprocessing technique, while the network used in nn-UNet is only the plain 3D U-Net architecture. To further characterize the ability of our proposed network, we further substitute the plain 3D U-Net architecture with our proposed 3D UX-Net and adapt the self-configuring hyperparameters for training. We demonstrate a significant improvement of performance in FeTA 2021 and FLARE 2021 datasets with mean organ Dice from 0.874 to 0.881 and from 0.934 to 0.944 respectively, as shown in Table6. To further investigate the difference in the network architecture, we observed that the convolution blocks in nn-UNet leverage the combination of instance normalization and leakyReLU. Such design allows to normalize channel-wise feature independently and mix the channel context with small kernel convolutional layers. In our design, we provide an alternative thought of extracting channel-wise features independently with depthwise convolution and mix the channel information during the downsampling layer only. Therefore, layer normalization is leveraged in our scenario and we want to further enhance the feature correspondence with large receptive field across channels efficiently. Furthermore, we found that the deep supervision strategy in nn-UNet, which compute an auxilary loss with each stages' intermediate output, also demonstrates its effectiveness to further improve the performance (FeTA 2021: from 0.876 to 0.881; FLARE 2021: from 0.940 to 0.944).

Albation Studies of Optimizing 3D U-XNet architecture on the Feta 2021 and FLARE 2021 testing dataset. (SD: Stage Depth, HDim: Hidden Dimension in the Bottleneck Layer.)Apart from the advantage of quantitative performance, we further leverage the LK depthwise convolutions to reduce the model parameters from 62.2M to 53.0M, compared to SwinUNETR in Table3. However, although the training efficiency of 3D UX-Net is already better than nn-UNet (FLOPs: 743.3G to 639.4G), we observed that the FLOPs of 3D UX-Net still remains at a high value. Inspired by the architectures of both Swin TransformerLiu et al. (2021) and ConvNeXtLiu et al. (2022) used in the natural image domain, we further remove the bottleneck layer (ResNet block with 768 channels) and increase the block depth of stage 3 (e.g., 8 blocks). Such optimized design further significantly reduces both the model parameters (from 53.0M to 32.1M, nn-UNet: 31.2M) and FLOPs (from 639.4G to 536.1G, nn-UNet: 743.3M), while preserving the performance (shown in Table

ACKNOWLEDGMENTS

This research is supported by NIH Common Fund and National Institute of Diabetes, Digestive and Kidney Diseases U54DK120058 (Spraggins), NSF CAREER 1452485, NIH 2R01EB006136, NIH 1R01EB017230 (Landman), and NIH R01NS09529. This study was in part using the resources of the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University, Nashville, TN. The identified datasets used for the analysis described were obtained from the Research Derivative (RD), database of clinical and related data. The imaging dataset(s) used for the analysis described were obtained from ImageVU, a research repository of medical imaging data and image-related metadata. ImageVU and RD are supported by the VICTR CTSA award (ULTR000445 from NCATS/NIH) and Vanderbilt University Medical Center institutional funding. ImageVU pilot work was also funded by PCORI (contract CDRN-1306-04869). We further thank Quan Liu, a Ph.D student in Computer Science Department of Vanderbilt University, to extensively discuss the initial idea of this paper.

A APPENDIX A.1 DATA PREPROCESSING & MODEL TRAINING

We apply hierarchical steps for data preprocessing: 1) intensity clipping is applied to further enhance the contrast of soft tissue (FLARE2021 & AMOS2022:{min:-175, max:250}). 2) Intensity normalization is performed after clipping for each volume and use min-max normalization: (X -X 1 )/(X 99 -X 1 ) to normalize the intensity value between 0 and 1, where X p denote as the p th percentile of intensity in X. We then randomly crop sub-volumes with size 96 × 96 × 96 at the foreground and perform data augmentations, including rotations, intensity shifting, and scaling (scaling factor: 0.1). All training processes with 3D UX-Net are optimized with an AdamW optimizer. We trained all models for 40000 steps using a learning rate of 0.0001 on an NVIDIA-Quadro RTX 5000 for both FeTA2021 and FLARE2021, while we perform training for AMOS2022 using NVIDIA-Quadro RTX A6000. One epoch takes approximately about 1 minute for FeTA2021, 10 minutes for FLARE2021, and 7 minutes for AMOS2022, respectively. We further summarize all the training parameters with Table 4 . 

