SWINZS3: ZERO-SHOT SEMANTIC SEGMENTATION WITH A SWIN TRANSFORMER

Abstract

Zero-shot semantic segmentation (ZS3) aims at learning to classify the never-seen classes with zero training samples. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited attention ability constraints existing network architectures to reason based on word embeddings. In this light of the recent successes achieved by Swin Transformers, we propose SwinZS3, a new framework exploiting the visual embeddings and semantic embeddings on joint embedding space. The SwinZS3 combines a transformer image encoder with a language encoder. The image encoder is trained by pixeltext score maps using the dense language-guided semantic prototypes which are computed by the language encoder. This allows the SwinZS3 could recognize the unseen classes at test time without retraining. We experiment with our method on the ZS3 standard benchmarks (PASCAL VOC and PASCAL Context) and the results demonstrate the effectiveness of our method by showing the state-of-art performance.

1. INTRODUCTION

Semantic segmentation is at the foundation of several high-level computer vision applications such as autonomous driving, medical imaging, and so on. 2020) . ZS3 benefits from semantic level supervision from texts by exploiting the semantic relationships between the pixels and the associated texts, which makes it enjoy a cheaper source of training data. The ZS3 methods could be categorized into generative and discriminative methods Baek et al. (2021) . Both could predict unseen classes using only language-guided semantic information of the corresponding classes. For the generative ZS3 methods Creswell et al. (2018); Kingma & Welling (2013) , segmentation networks are first trained with only seen classes labeled data. Then, they freeze the feature extractor to extract seen classes' visual features and train a semantic generator network to translate the language embedding to visual space. By doing this, the semantic generator could generate visual features conditioned on language embedding vectors. Finally, a classifier is trained to classify the features combined with features produced by the feature extractor on seen classes and generated features produced by the semantic generator from language embeddings on unseen classes. With generative methods achieve impressive performance in zero-shot semantic segmentation tasks. The methods are limited by a multi-stage training strategy, and the visual features extracted from the feature extractor not considering the language information during training. This will cause a seen bias problem towards the visual and generated features.



Recent deep learning has achieved great success in semantic segmentation Chen et al. (2018); Long et al. (2015); Ronneberger et al. (2015); Zhao et al. (2017b). However, the fully-supervised semantic segmentation models usually require extensive collections of labeled images with pixel-level annotations. And it could only handle the pre-defined classes. Considering the high cost of collecting dense labels, recently weakly supervised semantic(WSSS) segmentation methods have been explored. WSSS are often based on easily obtaining annotations, such as scribbles Sun et al.,bounding boxes Dai et al. (2015), and image-level labels Hou et al. (2018). Among them, a popular trend is based on network visualization techniques like classification activation map generating pseudo ground-truths Zeiler & Fergus (2014); Zhang et al. (2020). However, these methods also require the networks to have labeled images. On the contrary, humans could recognize novel classes with only descriptions. With the inspiration of this, some recent methods seek zero-shot semantic segmentation (ZS3) Zhao et al. (2017a); Bucher et al. (2019); Gu et al. (2020); Li et al. (

