SWINZS3: ZERO-SHOT SEMANTIC SEGMENTATION WITH A SWIN TRANSFORMER

Abstract

Zero-shot semantic segmentation (ZS3) aims at learning to classify the never-seen classes with zero training samples. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited attention ability constraints existing network architectures to reason based on word embeddings. In this light of the recent successes achieved by Swin Transformers, we propose SwinZS3, a new framework exploiting the visual embeddings and semantic embeddings on joint embedding space. The SwinZS3 combines a transformer image encoder with a language encoder. The image encoder is trained by pixeltext score maps using the dense language-guided semantic prototypes which are computed by the language encoder. This allows the SwinZS3 could recognize the unseen classes at test time without retraining. We experiment with our method on the ZS3 standard benchmarks (PASCAL VOC and PASCAL Context) and the results demonstrate the effectiveness of our method by showing the state-of-art performance.

1. INTRODUCTION

Semantic segmentation is at the foundation of several high-level computer vision applications such as autonomous driving, medical imaging, and so on. Recent deep learning has achieved great success in semantic segmentation Chen et al. (2018) ; Long et al. (2015) ; Ronneberger et al. (2015) ; Zhao et al. (2017b) . However, the fully-supervised semantic segmentation models usually require extensive collections of labeled images with pixel-level annotations. And it could only handle the pre-defined classes. Considering the high cost of collecting dense labels, recently weakly supervised semantic(WSSS) segmentation methods have been explored. WSSS are often based on easily obtaining annotations, such as scribbles Sun et al.,bounding boxes Dai et al. (2015) , and image-level labels Hou et al. (2018) . Among them, a popular trend is based on network visualization techniques like classification activation map generating pseudo ground-truths Zeiler & Fergus (2014); Zhang et al. (2020) . However, these methods also require the networks to have labeled images. On the contrary, humans could recognize novel classes with only descriptions. With the inspiration of this, some recent methods seek zero-shot semantic segmentation (ZS3) Zhao et al. 2020) . ZS3 benefits from semantic level supervision from texts by exploiting the semantic relationships between the pixels and the associated texts, which makes it enjoy a cheaper source of training data. The ZS3 methods could be categorized into generative and discriminative methods Baek et al. (2021) . Both could predict unseen classes using only language-guided semantic information of the corresponding classes. For the generative ZS3 methods Creswell et al. (2018) ; Kingma & Welling (2013) , segmentation networks are first trained with only seen classes labeled data. Then, they freeze the feature extractor to extract seen classes' visual features and train a semantic generator network to translate the language embedding to visual space. By doing this, the semantic generator could generate visual features conditioned on language embedding vectors. Finally, a classifier is trained to classify the features combined with features produced by the feature extractor on seen classes and generated features produced by the semantic generator from language embeddings on unseen classes. With generative methods achieve impressive performance in zero-shot semantic segmentation tasks. The methods are limited by a multi-stage training strategy, and the visual features extracted from the feature extractor not considering the language information during training. This will cause a seen bias problem towards the visual and generated features. 



(2017a); Bucher et al. (2019); Gu et al. (2020); Li et al. (

Figure 1: Effects of transformer's global reason and the score map decision boundary for zero-shot semantic segmentation. The motorbike(blue) is the unseen class. The existing solutions Deeplabv3+ often yield inaccurate segmentation results with limited receptive field and attention ability, losing fine-grained details. Using the transformer extractor significantly improves the prediction accuracy of the unseen classes. But there is still seen bias problem, which classifies the unseen class pixels into seen classes. So, our SwinZS3 proposes a language-guided score map to reduce itTo overcome the limitation, we introduce a discriminative approach for ZS3 that exploits to train visual and language encoders on a joint embedding space. During the training time, we avoid the multi-stage training strategy. We alleviate the seen and unseen bias problem by minimizing the euclidean distances and using the pixel-text score maps between the semantic prototypes produced by the language encoder and the visual features of corresponding classes. At test time, we use the learning discriminative language prototypes and combine the pixel-text score map with the euclidean distance as the decision boundary to avoid retraining.Improving the network backbone of zero-shot networks is another effective way to reduce the bias problem. As shown in Figure.1 We argue that a shared shortcoming of previous ZS3 models falls in the reduced receptive field of CNNs and less using attention mechanisms for extracting the global relations of visual features conditioned with language semantic information. The local nature of convolutions leads CNNs to extract visual features missing long-range relationships across the same image. CNNs based frameworks sometimes fail to extract language-guided activation fields for lacking the global perceiving attention mechanism. Recently, transformers Vaswani et al. (2017) have significant breakthroughs in both of natural language processing(NLP) and computer vision(CV) field Xie et al. (2021a); Zheng et al. (2021); Arnab et al. (2021). Dosovitskiy et al. (2020) (ViT) is the first work to apply the transformer architecture to image classification. Moreover, the Liu et al. (2021) (swin-transformer) presents a new architecture for more general-purpose vision tasks, especially dense predicting. We argue that the self-attention mechanism benefits the zero-shot semantic segmentation tasks and the semantic information supervisor. The transformer-based model could capture the global feature relations and the semantic information in visual features by multi-head self-attention(MHSA). This paper takes this missing step and explores the swin-transformer for ZS3. It combines convolutional layers and transformer blocks to model the global information guided by pixel-text distances and score maps. We also improve the decision boundary by modifying the Nearest Neighbor(NN) Classifier with weighted euclidean distance by score map. We demonstrate the effectiveness of our approach on standard zero-shot semantic segmentation benchmarks, achieving state-of-the-art performance on PASCAL-VOC Everingham et al. (2010) and PASCAL-Context Mottaghi et al. (2014). Some methods based on CLIP (Radford et al. (2021)Xu et al. (2022)Ding et al. (2022)Arnab et al. (2021) Xu et al. (2021)) often claim to be zero-shot learning methods. However, those methods usually use all classes of images and text labels during training, which will cause supervision leakage.

