3D UX-NET: A LARGE KERNEL VOLUMETRIC CON-VNET MODERNIZING HIERARCHICAL TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION

Abstract

The recent 3D medical ViTs (e.g., SwinUNETR) achieve the state-of-the-art performances on several 3D volumetric data benchmarks, including 3D medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local selfattention and the large number of model parameters. We hypothesize that volumetric ConvNets can simulate the large receptive field behavior of these learning approaches with fewer model parameters using depth-wise convolution. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel (LK) size (e.g. starting from 7 × 7 × 7) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms Swin-UNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of 2.27% Dice (from 0.880 to 0.900).



). Such performance gain is largely owing to the large receptive field from 3D shift window multi-head self-attention (MSA). However, the computation of shift window MSA is computational unscalable to achieve via traditional 3D volumetric ConvNet architectures. As the advancement of ViTs starts to bring back the concepts of convolution, the key components for such large performance differences are attributed to the scaling behavior and global self-attention with large receptive fields. As such, we further ask: Can we leverage convolution modules to enable the capabilities of hierarchical transformers? The recent advance in LK-based depthwise convolution design (e.g., Liu et al. Liu et al. ( 2022)) provides a computationally scalable mechanism for large receptive field in 2D ConvNet. Inspired by such design, this study revisits the 3D volumetric ConvNet design to investigate the feasibility of (1) achieving the SOTA performance via a pure ConvNet architecture, (2) yielding much less network complexity compared with 3D ViTs, and (3) providing a new direction of designing 3D ConvNet on volumetric high resolution tasks. Unlike SwinUNETR, we propose a lightweight volumetric ConvNet 3D UX-Net to adapt the intrinsic properties of Swin Transformer with ConvNet modules and enhance the volumetric segmentation performance with smaller model capacities. Specifically, we introduce volumetric depth-wise convolutions with LK sizes to simulate the operation of large receptive fields for generating self-attention in Swin transformer. Furthermore, instead of linear scaling the self-attention feature across channels, we further introduce the pointwise depth convolution scaling to distribute each channel-wise feature independently into a wider hidden dimension (e.g., 4×input channel), thus minimizing the redundancy of learned context across channels and preserving model performances without increasing model capacity. We evaluate 3D UX-Net on supervised volumetric segmentation tasks with three public volumetric datasets: 1) MICCAI Challenge 2021 FeTA (infant brain imaging), 2) MICCAI Challenge 2021 FLARE (abdominal imaging), and 3) MICCAI Challenge 2022 AMOS (abdominal imaging). Surprisingly, 3D UX-Net, a network constructed purely from ConvNet modules, demonstrates a consistent improvement across all datasets comparing with current transformer SOTA. We summarize our contributions as below: • We propose the 3D UX-Net to adapt transformer behavior purely with ConvNet modules in a volumetric setting. To our best knowledge, this is the first large kernel block design of leveraging 3D depthwise convolutions to compete favorably with transformer SOTAs in volumetric segmentation tasks. • We leverage depth-wise convolution with LK size as the generic feature extraction backbone, and introduce pointwise depth convolution to scale the extracted representations effectively with less parameters. • We use three challenging public datasets to evaluate 3D UX-Net in 1) direct training and 2) finetuning scenarios with volumetric multi-organ/tissues segmentation. 3D UX-Net achieves consistently improvement in both scenarios across all ConvNets and transformers SOTA with fewer model parameters. 



been made recently with the introduction of vision transformers (ViTs) Dosovitskiy et al. (2020) into 3D medical downstream tasks, especially for volumetric segmentation benchmarks Wang et al. (2021); Hatamizadeh et al. (2022b); Zhou et al. (2021); Xie et al. (2021); Chen et al. (2021). The characteristics of ViTs are the lack of image-specific inductive bias and the scaling behaviour, which are enhanced by large model capacities and dataset sizes. Both characteristics contribute to the significant improvement compared to ConvNets on medical image segmentation Tang et al. (2022); Bao et al. (2021); He et al. (2022); Atito et al. (2021). However, it is challenging to adapt 3D ViT models as generic network backbones due to the high complexity of computing global self-attention with respect to the input size, especially in high resolution images with dense features across scales. Therefore, hierarchical transformers are proposed to bridge these gaps with their intrinsic hybrid structure Zhang et al. (2022); Liu et al. (2021). Introducing the "slid-ing window" strategy into ViTs termed Swin Transformer behave similarily with ConvNets Liu et al. (2021). SwinUNETR adapts Swin transformer blocks as the generic vision encoder backbone and achieves current state-of-the-art performance on several 3D segmentation benchmarks Hatamizadeh et al. (2022a); Tang et al. (

TRANSFORMER-BASED SEGMENTATION Significant efforts have been put into integrating ViTs for dense predictions in medical imaging domain Hatamizadeh et al. (2022b); Chen et al. (2021); Zhou et al. (2021); Wang et al. (2021). With the advancement of Swin Transformer, SwinUNETR equips the encoder with the Swin Transformer blocks to compute self-attention for enhancing brain tumor segmentation accuracy in 3D MRI Images Hatamizadeh et al. (2022a). Tang et al. extends the SwinUNETR by adding a selfsupervised learning pre-training strategy for fine-tuning segmentation tasks. Another Unet-like architecture Swin-Unet further adapts Swin Transformer on both the encoder and decoder network via skip-connections to learn local and global semantic features for multi-abdominal CT segmentation Cao et al. (2021). Similarly, SwinBTS has the similar intrinsic structure with Swin-Unet with an enhanced transformer module for detailed feature extraction Jiang et al. (2022). However, the

