SWINZS3: ZERO-SHOT SEMANTIC SEGMENTATION WITH A SWIN TRANSFORMER

Abstract

Zero-shot semantic segmentation (ZS3) aims at learning to classify the never-seen classes with zero training samples. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited attention ability constraints existing network architectures to reason based on word embeddings. In this light of the recent successes achieved by Swin Transformers, we propose SwinZS3, a new framework exploiting the visual embeddings and semantic embeddings on joint embedding space. The SwinZS3 combines a transformer image encoder with a language encoder. The image encoder is trained by pixeltext score maps using the dense language-guided semantic prototypes which are computed by the language encoder. This allows the SwinZS3 could recognize the unseen classes at test time without retraining. We experiment with our method on the ZS3 standard benchmarks (PASCAL VOC and PASCAL Context) and the results demonstrate the effectiveness of our method by showing the state-of-art performance.

1. INTRODUCTION

Semantic segmentation is at the foundation of several high-level computer vision applications such as autonomous driving, medical imaging, and so on. Recent deep learning has achieved great success in semantic segmentation Chen et al. (2018) ; Long et al. (2015) ; Ronneberger et al. (2015) ; Zhao et al. (2017b) . However, the fully-supervised semantic segmentation models usually require extensive collections of labeled images with pixel-level annotations. And it could only handle the pre-defined classes. Considering the high cost of collecting dense labels, recently weakly supervised semantic(WSSS) segmentation methods have been explored. WSSS are often based on easily obtaining annotations, such as scribbles Sun et al.,bounding boxes Dai et al. (2015) , and image-level labels Hou et al. (2018) . Among them, a popular trend is based on network visualization techniques like classification activation map generating pseudo ground-truths Zeiler & Fergus (2014) ; Zhang et al. (2020) . However, these methods also require the networks to have labeled images. On the contrary, humans could recognize novel classes with only descriptions. With the inspiration of this, some recent methods seek zero-shot semantic segmentation (ZS3) Zhao et al. (2017a) ; Bucher et al. (2019) ; Gu et al. (2020) ; Li et al. (2020) . ZS3 benefits from semantic level supervision from texts by exploiting the semantic relationships between the pixels and the associated texts, which makes it enjoy a cheaper source of training data. The ZS3 methods could be categorized into generative and discriminative methods Baek et al. (2021) . Both could predict unseen classes using only language-guided semantic information of the corresponding classes. For the generative ZS3 methods Creswell et al. (2018) ; Kingma & Welling (2013) , segmentation networks are first trained with only seen classes labeled data. Then, they freeze the feature extractor to extract seen classes' visual features and train a semantic generator network to translate the language embedding to visual space. By doing this, the semantic generator could generate visual features conditioned on language embedding vectors. Finally, a classifier is trained to classify the features combined with features produced by the feature extractor on seen classes and generated features produced by the semantic generator from language embeddings on unseen classes. With generative methods achieve impressive performance in zero-shot semantic segmentation tasks. The methods are limited by a multi-stage training strategy, and the visual features extracted from the feature extractor not considering the language information during training. This will cause a seen bias problem towards the visual and generated features. To overcome the limitation, we introduce a discriminative approach for ZS3 that exploits to train visual and language encoders on a joint embedding space. During the training time, we avoid the multi-stage training strategy. We alleviate the seen and unseen bias problem by minimizing the euclidean distances and using the pixel-text score maps between the semantic prototypes produced by the language encoder and the visual features of corresponding classes. At test time, we use the learning discriminative language prototypes and combine the pixel-text score map with the euclidean distance as the decision boundary to avoid retraining. Improving the network backbone of zero-shot networks is another effective way to reduce the bias problem. As shown in Figure .1 We argue that a shared shortcoming of previous ZS3 models falls in the reduced receptive field of CNNs and less using attention mechanisms for extracting the global relations of visual features conditioned with language semantic information. 

2. RELATED WORK

Semantic segmentation: Semantic segmentation has made great advancements due to the rise of deep learning. Most recent state-of-the-art models are based on fully convolutional neural networks Long et al. (2015) and assume that all the training data have pixel-level annotations. DeepLab Zero-shot semantic segmentation: The ZS3 networks could be categorized into discriminative and generative methods. For the discriminative approach, the work of Zhao et al. (2017a) focuses on hierarchical predicting unseen classes by adopting the discriminative approach. SPNet Xian et al. (2019) exploits a semantic embedding space by mapping visual features to fixed semantic ones. JoEm Baek et al. (2021) propose to align visual and semantic featuers in joint embedding space. In contrast to discriminative methods, ZS3Net Bucher et al. (2019) synthesize visual features by a generative moment matching network (GMMN). However, the ZS3Net training pipeline consists of three stages that will cause the bias problem. CSRL Li et al. (2020) exploit the relations of both seen and unseen classes to preserve them to synthesized visual features. CaGNet Gu et al. (2020) proposes to use the channel-wise attention mechanism in dilated convolutional layers for extracting visual features. Visual-language learning: Recently years, image-language pairs learning is a rapidly growing field. There are some representative works such as CLIP Radford et al. (2021) and ALIGN Jia et al. (2021) which are pretrained on hundreds of millions of image-language pairs. Yang et al. ( 2022) presents the unified contrastive learning method that can leverage both the image-language methods and image-label data. And our work further extends the method to pixel-level on ZS3. We visualize the visual features by circles and the semantic prototypes by triangles. Because of the bias problem in zero-shot learning, the visual features of seen classes are tightly, and the unseen classes' semantic prototypes and visual features are biased. (a) We show one of the situations where the euclidean distance d1 is smaller than d2. So, the unseen classes' pixels will be classified into seen classes. However, for score map distance show a1 is bigger than a2, which inspires us to use the score map distance to modify the euclidean distance. After adjusting, we could get the (b) view. It is crucial to improve the performance in ZS3.

3.1. MOTIVATIONS

Unlike supervised semantic segmentation methods, The unseen classes prototypes of discriminative zero-shot semantic segmentation rely on joint optimization of the visual encoder and language encoder. Thus, to achieve good performance, this formulation requires the network to perceive the language context structure. Current network Baek et al. (2021) adopt traditional convolutional layers for aggregating language information. However, the intrinsic locality and weak attention of the convolution operator can hardly model long-range and accurate visual-language joint features. So, we propose to use transformer-based blocks to address the limitation. Another limitation of ZS3 is the seen bias problem. Lacking the unseen classes labeled data, it is difficult for the visual encoder to extract distinguishable features. As shown in Fig. 3 , modulating the decision boundary could also reduce the bias problem. We analyze the shortcomings of traditional NN classifiers and use the new decision boundary to improve the performance. Driven by this, we design our SwinZS3 framework. As shown in Fig. 2 .

3.2. OVERVIEW

Following the common practice, we divide classes into seen classes S and unseen classes U . During the training time, we train our model, which includes a visual feature extractor and a semantic prototype encoder with the seen classes set S only. Zero-shot semantic segmentation aims to allow the model to recognize both seen classes S and unseen classes U in the test time. We use the visual extractor to extract visual features and input language embeddings (word2vec) to the language encoder for obtaining semantic prototypes of corresponding classes. The visual features will be input into a classifier and supervised by ground-truth labels. The language prototypes will have a regression loss with the visual features. And the prototypes' inter-relationship are transferred from the language embeddings like word2vec using the semantic-consistency loss. And then SwinZS3 calculates pixel-text score maps in a hyper-sphere space for the projected visual features and semantic prototypes. These score maps are also supervised by the ground-truth labels. The visual features are also fed into a classifier for being supervised by ground truth labels. In the following, we describe our framework in detail.

3.3. TRANSFORMER BACKBONE

Our framework uses the transformer Liu et al. (2021) as backbone that an input image is split into non-overlapping h × w patches. The patches will be projected to h × w tokens. The multi-head selfattention (MHSA) layer is used for the transformer block to capture the global features information. In MHSA, the patch tokens are projected to queries Q ∈ R hw×d k , keys K ∈ R hw×d k , and values V ∈ R hw×dv . h and w are the shapes of feature maps. d k and d v denote the dimension of features. Based on Q, K, and V , outputs X are X = sof tmax( QK T √ d )V (1) The MHSA is the core operation of the transformer block. The transformer backbone's final output is produced by stacking multiple transformer blocks.

3.4. NETWORK TRAINING

As shown in Fig. 2 , our framework consists of four loss terms : cross-entropy loss L ce Misra & Maaten (2020) , pixel-wise regression loss L r , pixel-text score map loss L aux , semantic consistency loss L sc . The overall loss is finally formulated as L = L ce + L r + λ 1 L sc + λ 2 L aux (2) where λ 1 and λ 2 balance the contributions of different losses. Cross-entropy loss: Given the final outputs of feature maps υ ∈ R h×w×c , where h, w, and c are the height, width, and the number of channels. Then, υ will be put into a classifier head f c . For the zero-shot setting, the classifier could learn the seen classes. So, we apply the cross-entropy loss Murphy (2012) that is widely adopted in supervised semantic segmentation on the seen classes set S as follows: L ce = - 1 ! c∈S | N c | " c∈S " p∈Nc log( e wcυ(p) ! j∈S e wj (υ(p)) ) (3) where N c indicates a set of locations labeled as the class c in ground-truth. Regression loss: Although the l ce could train the model to a discriminative embedding space on seen classes S. However, the model is not adaptable to classify the unseen classes U while the classifier head does not learn the unseen classes prototypes. At test time, we want to use the language prototypes of both seen and unseen classes as the classifier to recognize the dense visual features extracted by the transformer backbone. For this, the distances of visual features and corresponding language prototypes should be minimized in embedding space. To address it, we introduced regression loss l r . As the l ce , we first get the final outputs visual feature maps υ ∈ R h×w×c . Then, we get semantic feature maps s ∈ R h×w×d where each pixel s c of s is a word or language embedding and the same class with corresponding visual feature pixel. Given the language embedding maps, we input them to semantic encoder f s as follows: µ = f s (s) where µ ∈ R h×w×c . We denote each pixel of µ c is a semantic prototype for a class c. So, the regression loss as follows: L r = 1 ! c∈S | R c | " c∈S " s∈Rc d(υ(s), µ(s)) (5) where d() is the euclidean distance metric. R c means the regions labeled with the same class in the ground truth. The l r give a promise that the dense visual features and semantic prototypes will be projected to a join embedding space where the pixels of corresponding classes will be close. But, for ZS3, there are similar limitations with l ce : The l r deal with pixel-wise visual features and semantic The cross-entropy loss could be considered as pixel-label learning by assigning pixels to groundtruth labels, and the relationship between the pixels and labels is many-to-one. (b) The pixel-text score map loss focuses on the relationship for the semantic prototypes and the visual features, which means a semantic prototype will be assigned to many pixels. (c) For semantic-consistency loss, it keeps the structure of language embedding like word2vec. prototypes independently but ignore explicitly considering the other pixels' relationship between them. To address it, we proposed to use the contrastive loss. Pixel-text score map: In our framework, we use the score map to reduce the seen bias problem in ZS3. As shown in Fig. 4 , to get the discriminative joint embedding space, we compute the pixeltext score maps using the language prototypes µ c ∈ R k×c and the final outputs of feature maps υ ∈ R h×w×c by: s = υ μT c , s ∈ R h×w×k where μc and υ are the l 2 normalized version of υ and µ c along the channel dimension. By the way, the µ c in the score map must use the seen classes prototypes only. Otherwise, it will make the unseen bias problem more serious. The score maps characterize the results of visual features pixel and language-guided semantic prototypes matching, which is one of the most crucial parts of our SwinZS3. First, we use the score maps to compute an auxiliary segmentation loss: l aux = CrossEntropy(Sof tmax(s/τ ), y) where τ is a temperature coefficient which we set 0.07 and y is the ground truth label. The auxiliary segmentation loss can make the joint embedding space more discriminative which is beneficial to zero-shot semantic segmentation.

Semantic consistency loss:

The semantic consistency loss l sc could transfer the relationship of word2vec to the semantic prototypes embedding space. For adopting the pre-trained word embedding features, l sc could keep the class contextual information. The l sc defines the relation of prototypes as follows: r µ ij = e -τµd(µi,µj ) ! j∈S e -τµd(µi,µj ) where d() is the distance between two prototypes and τ µ is a temperature parameter. Then for word embedding space, the relationship could be defined: r ij = e -τsd(si,sj ) ! j∈S e -τsd(si,sj ) So, the semantic consistency loss is defined as follows: Lsc = - " i∈S " j∈S r ij log r µ ij r ij The l sc could distill the word embedding contextual information to the prototypes.

3.5. NETWORK INFERENCE

During the inference, we use the semantic prototypes, which are the outputs of the semantic encoder as a NN classifier Cover & Hart (1967) . We compute the euclidean distances and score maps from individual visual features to language prototypes, and classify each visual feature to the nearest language prototypes as follows: ŷ(p) = argmin c∈S∪U d(υ(p), µ c )(1 -sigmoid(s)) ( ) where d is the euclidean distance metric and s is the score map. For the unseen classes U still being biased towards the seen classes S, we adapted the work of Baek et al. (2021) , which proposed the Apollonius circle. The top2 nearest language prototypes with individual visual features are d 1 and d 2 . d 1 is the euclidean and score distance with the language prototype µ 1 and d 2 is the distance with language prototype µ 2 where we denote c 1 and c 2 as the class of µ 1 and µ 2 . The decision rule is defined with the Apollonius circle as follows: ŷ(p) = # c(p) c 1 ∈ S and c 2 ∈ U c 1 otherwise where c(p) = c 1 Π[ d 1 d 2 ≤ γ] + c 2 Π[ d 1 d 2 > γ] We denote Π as an function whose value 1 if the argument is true, and 0 otherwise. The γ is an adjustable parameter which could modulate the decision boundary.

4.1. IMPLEMENTATION DETAILS

Training: For transformer backbone, we use the swin transformer (swin-tiny) proposed in Singh & Lee (2017), which is a baseline for transformer-baed ZS3 task. To avoid supervision leakage from unseen classes Xian et al. (2018) , the backbone parameters are initialized with self-supervised model MoBY Xie et al. (2021b) pre-trained on Imagenet. We use an AdamW optimizer as the optimizer to train SwinZS3. For the backbone, we set the initial learning rate as 1 × 10 -4 , and it uses the polynomial scheduler to decay at every iteration. The other parameters' learning rate is 10 times the backbone parameters' learning rate. The weight decay factor is set as 0.01. For data augmentation, we keep the same setting with Baek et al. (2021) . For other parameters (λ, γ), we set λ 1 , λ 2 to 0.1 and the γ is 0.6. Figure 5 : Qualitative results on PASCAL VOC. The unseen classes is "cow","motobike","cat". We compare the results of the other state-of-art method and our SwinZS3. (4) 12-8 classes (bird,tvmonitor). Each split contains previous unseen classes gradually. Then the model is evaluated on the full 1449 validation images. Evaluation metrics: We use the mean intersection-over-union (mIoU ) as evaluation metrics Long et al. (2015) . In detail, we separately count the metrics of the seen classes and the unseen classes which are denoted by mIoU s and mIoU u . We also adopt the harmonic mean (hIoU ) of mIoU s and mIoU u for the arithmetic mean might be dominated by mIoU s .

4.2. ABLATION EXPERIMENT AND RESULTS

Ablation study: In the table 1, we present an ablation analysis on two aspects: (a) CNNs vs Transformer (b) Whether to use the score map (l aux ) to modulate the decision boundary. For cross-entropy loss and regression loss are crucial to recognize the unseen classes, we choose the Deeplabv3+ with l ce , l r and l sc as our baseline. For the first row, we report the baseline IoU scores without l aux . Then, we compare swin transformer (swin-tiny) backbone with Deeplabv3+ and transformer backbone gives gain of hIoU 1.0 over the baseline. The second and third rows show that the l aux respectively give gain of mIoU u 3.0 and 3.1 over the baseline and 1.9, 2.1 over the SwinZS3 baseline. This shows a significant improvement for the ZS3 and gives a demonstration of the effectiveness of the two methods. Finally, we combine the transformer and score maps, and report the best mIoU scores. Comparison to state-of-the-Arts: As showed in table 2, we compare our approach with other state-of-art methods on PASCAL VOC and Context. We report the best scores of different split 



Figure 1: Effects of transformer's global reason and the score map decision boundary for zero-shot semantic segmentation. The motorbike(blue) is the unseen class. The existing solutions Deeplabv3+ often yield inaccurate segmentation results with limited receptive field and attention ability, losing fine-grained details. Using the transformer extractor significantly improves the prediction accuracy of the unseen classes. But there is still seen bias problem, which classifies the unseen class pixels into seen classes. So, our SwinZS3 proposes a language-guided score map to reduce it

Figure2: The overall framework of our approach SwinZS3. SwinZS3 first extracts the image visual embeddings using a transformer-based feature extractor and K-class semantic prototypes using a language encoder. The prototypes will have a regression loss with the visual features and their interrelationship are transferred from the language embeddings (word2vec) using semantic consistency loss. And then SwinZS3 calculates pixel-text score maps in a hyper-sphere space for the projected visual features and semantic prototypes. The score maps are supervised by the ground-truth labels. The visual features are also fed into a classifier for being supervised by ground-truth labels with cross-entropy loss.

Figure 3: Compare the difference between the decision boundary without and with the score map.We visualize the visual features by circles and the semantic prototypes by triangles. Because of the bias problem in zero-shot learning, the visual features of seen classes are tightly, and the unseen classes' semantic prototypes and visual features are biased. (a) We show one of the situations where the euclidean distance d1 is smaller than d2. So, the unseen classes' pixels will be classified into seen classes. However, for score map distance show a1 is bigger than a2, which inspires us to use the score map distance to modify the euclidean distance. After adjusting, we could get the (b) view. It is crucial to improve the performance in ZS3.

Figure 4: An illustrative similarity matrix comparisons between different losses : N is the number of pixel samples fro image features extracted by visual encoder. K is the number of classes. (a)The cross-entropy loss could be considered as pixel-label learning by assigning pixels to groundtruth labels, and the relationship between the pixels and labels is many-to-one. (b) The pixel-text score map loss focuses on the relationship for the semantic prototypes and the visual features, which means a semantic prototype will be assigned to many pixels. (c) For semantic-consistency loss, it keeps the structure of language embedding like word2vec.

We perform experiments on PASCAL VOC and PASCAL Context. The PASCAL-VOC2012 dataset contains 1464 training images and 1449 validation images with a total of 21 categories (20 object categories and background). The PASCAL Context dataset contains 4998 training and 5105 validation samples of 60 classes with 59 different categories and a single background class. Following the common practice, we adopt the 10582 augmented training samples for PAS-CAL VOC. For the zero-shot semantic segmentation network, we divide Pascal-VOC2012 training samples according to N-seen and 20-N unseen classes. For example, we choose cow and motorbike as unseen categories. Then we filter out those samples with cow and motorbike labels and train the segmentation network using the remaining samples. During the training time, the segmentation model should keep the mIOU of unseen classes 0. We follow the experiment settings provided by ZS3Net which dividing the Pascal-VOC 2012 training samples 20 object classes into four splits (1) 18-2 classes (cow, motorbike), (2) 16-4 classes (cat, sofa), (3) 14-6 classes (boat, fence), and

Ablation study on the unseen-6 split of PASCAL Context by comparing mIoU scores using different loss terms.

Quantitative results on the PASCAL VOC and Context validation sets. The numbers in bold are the best performance.

