SUPPRESSING THE HETEROGENEITY: A STRONG FEATURE EXTRACTOR FOR FEW-SHOT SEGMENTATION

Abstract

This paper tackles the Few-shot Semantic Segmentation (FSS) task with focus on learning the feature extractor. Somehow the feature extractor has been overlooked by recent state-of-the-art methods, which directly use a deep model pretrained on ImageNet for feature extraction (without further fine-tuning). Under this background, we think the FSS feature extractor deserves exploration and observe the heterogeneity (i.e., the intra-class diversity in the raw images) as a critical challenge hindering the intra-class feature compactness. The heterogeneity has three levels from coarse to fine: 1) Sample-level: the inevitable distribution gap between the support and query images makes them heterogeneous from each other. 2) Region-level: the background in FSS actually contains multiple regions with different semantics. 3) Patch-level: some neighboring patches belonging to a same class may appear quite different from each other. Motivated by these observations, we propose a feature extractor with Multi-level Heterogeneity Suppressing (MuHS). MuHS leverages the attention mechanism in transformer backbone to effectively suppress all these three-level heterogeneity. Concretely, MuHS reinforces the attention / interaction between different samples (query and support), different regions and neighboring patches by constructing cross-sample attention, cross-region interaction and a novel masked image segmentation (inspired by the recent masked image modeling), respectively. We empirically show that 1) MuHS brings consistent improvement for various FSS heads and 2) using a simple linear classification head, MuHS sets new states of the art on multiple FSS datasets, validating the importance of FSS feature learning.

1. INTRODUCTION

Few-shot semantic segmentation (FSS) aims to generalize the semantic segmentation model from base classes to novel classes, using very few support samples. FSS depicts a potential to reduce the notoriously expensive pixel-wise annotation and has thus drawn great research interest. However, we observe that the current research has been biased towards partial component of the FSS framework. Concretely, an FSS framework typically consists of a feature extractor and a matching head, while the recent state-of-the-art methods (Zhang et al. (2019) ; Tian et al. (2020b) ; Li et al. (2021a) ; Xie et al. (2021b) ; Wu et al. (2021) ; Zhang et al. (2021a) ; Li et al. (2020) ) all focus on the matching head. They pay NO effort on learning the feature extractor and adopt a ImageNet-pretrained model without any fine-tuning. Under this background, we think the FSS feature extractor deserves exploration and take a rethink on the corresponding challenge. Some prior literature (Tian et al. (2020b) ; Zhang et al. (2021b) ) argue that the challenge is mainly because the limited support samples are insufficient for finetuning a large feature extractor (e.g., ResNet-50 (He et al. (2016) )), therefore leading to the overfitting problem. We hold a different perspective and observe the heterogeneity (i.e., the intra-class diversity in the raw images) as a critical challenge hindering the intra-class compactness of FSS features. Although the heterogeneity is not a unique problem in FSS (e.g., it does exist in the shows the sample-level heterogeneity between the support and the query. The "cow" in the support is adopted to segment "cattle" in the query, in spite of their different appearance. (b) shows the region-level heterogeneity in the background. When the foreground object is the "rider", the "horse" should share the same class (BG:background) with "grass". (c) shows the patch-level heterogeneity among neighboring patches. The color of upper and lower part of body is different. canonical segmentation as well), its challenge is significantly amplified by the few-shot setting. In our viewpoint, the heterogeneity has three levels from coarse to fine: • Sample-level heterogeneity exists between the query and support images due to their distribution gap. For example, in Fig. 1 (a), the foreground objects ("cow" and "cattle") in the support and query images look quite different, although they both belong to a same semantic class "cow". • Region-level heterogeneity exists (mostly) in the background, which actually contains multiple regions with different semantics. In Fig. 1 (b), "horse" in the image is a foreground region when the support object is another horse. However, when the support object shifts to a "rider", the horse in the image should be merged into the background, resulting in the region-level heterogeneity. • Patch-level heterogeneity exists among neighboring patches which belong to a same semantic class but have significant appearance variations. For example, in Fig. 1 (c), the upper and lower body of a single person are in different colors, therefore introducing patch-level heterogeneity. Motivated by these observations, we propose an FSS feature extractor with Multi-level Heterogeneity Suppressing (MuHS). MuHS adopts the transformer backbone and leverages the attention mechanism to suppress all these three-level heterogeneity. Our choice of using the transformer backbone is natural: the attention mechanism provides strong potential for constructing long-range dependencies across samples, regions and patches. Concretely, MuHS reinforces the attention / interaction between different samples (query and support), different regions and neighboring patches by constructing cross-sample attention, cross-region interaction and a novel masked image segmentation, respectively. To be more specific, these attention / interaction are as below: (i) Cross-Sample Attention. In popular transformers, the attention is within each single sample and does not cross multiple samples. In contrast, MuHS constructs cross-sample attention with a novel design of "linking" tokens. In each transformer layer, we use some linking tokens to connect all the patch tokens from the query and support samples simultaneously, therefore efficiently propagating information across different samples. (ii) Cross-Region Interaction. In popular transformers, the attention usually encourages feature interaction (absorption) between similar patch tokens. In contrary to this common practice, MuHS enforces additional feature absorption between patch tokens from dissimilar regions in the background. Such a cross-region interaction smooths the background and suppresses the region-level heterogeneity. (iii) Masked Image Segmentation. Inspired by the recent masked image modeling (MIM), MuHS randomly masks some patch tokens and makes partial prediction for the existing patches. Afterwards, MuHS fills trainable mask tokens and encourages the deep model to make the holistic prediction for complete patches, yielding a novel masked image segmentation. The learned capacity of inferring the semantics of the masked patches from neighboring patches suppresses the patch-level heterogeneity. In MuHS, the above three components respectively mitigate a corresponding type of heterogeneity and achieve complementary benefits for few-shot semantic segmentation. Empirically, we show that using MuHS to replace the frozen feature extractor (pretrained on ImageNet) brings consistent improvement for multiple popular FSS heads. Importantly, since the MuHS feature has relatively good intra-class compactness, we simply cooperate it with a linear classification head and achieve new state of the art on multiple FSS datasets. For example, on PASCAL-5 i , MuHS achieves 69.1% mIoU under 1-shot setting. Our main contributions are summarized as follows: First, we shift the FSS research focus from the matching head to the feature extractor and reveal the heterogeneity as an important challenge. Second. we propose Multi-level Heterogeneity Suppressing (MuHS). MuHS utilizes novel crosssample attention, cross-region interaction and masked image segmentation to suppress the heterogeneity from three levels. Different from these recent progresses, this paper focuses on learning the feature extractor. The proposed MuHS feature extractor brings general improvement for a battery of matching heads and achieves state-of-the-art accuracy with a simple linear classification head. Under FSS scenario, we observe three-level heterogeneity (i.e., sample-level, region-level, patchlevel), which hinders intra-class compactness of FSS features. We think the attention mechanism in the transformer provides strong potential for constructing long-range dependencies across samples, regions and patches. Therefore, the proposed MuHS adopts the transformer network as its backbone and utilizes the characteristics of transformer to suppress all these three-level heterogeneity in a unified framework. Masked Image Modeling. Masked modeling methods (Devlin et al. (2018) ; Radford et al. (2018; 2019) ) are wildly utilized in NLP tasks. BERT (Devlin et al. (2018) ) utilizes a "masked language model" (MLM) to randomly masks input and predict the original vocabulary id of the masked tokens. Motivated by BERT, BEIT (Bao et al. (2021) ) proposes "masked image modeling" (MIM) to perform self-supervised learning on Vision task. It randomly masks some proportion of image patches and replaces them with a mask tokens. Recently, SimMIM (Xie et al. (2022) ), MAE (He et al. (2022) ) simplify the MIM designs and improves transformers. An important component of MuHS is inspired by the masked image modeling to suppress the patchlevel heterogeneity. Different from the popular self-supervised scheme, MIS is fully supervised. It masks out some query patches at the input and yet maintains the holistic prediction for segmentation, yielding a novel Masked Image Segmentation (MIS) learning. 

3.1. OVERVIEW

In the few-shot segmentation (FSS) task, we have a training set and a testing set with no labeling overlap. The testing set consists of a support set S and a query set Q, which are from the same novel category (unseen in the training set). To set up a N -shot scenario, we randomly sample N support images {I 1 s , ..., I N s } with corresponding masks {Y 1 s , ..., Y N s } from S to recognize the same-semantic region in the query image I q ∈ Q. During training, we follow the popular meta-learning scheme (Tian et al. (2020a) ; Pambala et al. ( 2021)) and construct a meta-task,i.e., sampling N support and one query from the training set into each episode. The proposed MuHS feature extractor is based on the transformer backbone consisting of L transformer blocks, as illustrated in Fig. 2 (a) . Given an input image, we split it into non-overlapping image patches, linearly embed them into patch tokens, and feed the patch tokens into the MuHS feature extractor. We utilize X q , X s to denote the embedding of query tokens and support tokens, respectively. Specifically for the query image, we randomly discard some patch tokens. Based on the transformer backbone, MuHS has three major components as below: 1) MuHS enforces a Cross-Sample Attention between the support and query through some linking tokens. These linking tokens are trainable and update themselves by absorbing all query and support tokens simultaneously. In the following transformer block, the updated state of the linking tokens are absorbed by the query and support, respectively. Therefore, it facilitates interaction between the query and support with relatively low computational cost. The details are in Section 3.2. 2) Within each transformer block, MuHS appends an additional Cross-Region Interaction upon the original self-attention layer. While the original self-attention encourages interaction among similar patch tokens, the Cross-Region Interaction promotes interaction among dissimilar regions in the background. The details are in Section 3.3. 3) Based on the output of the MuHS feature extractor, we use a matching head (e.g., the linear classification head) to make partial prediction for the existing patches of the query image (note that some query patches are discarded at the input layer). Afterwards, we fill trainable mask tokens to the incomplete query patch tokens, and input them into a Masked Image Segmentation model consisting of M transformer decoder. The Masked Image Segmentation aims to make holistic prediction of the input query, regardless of the discarded patches. The details are in Section 3.4.

3.2. CROSS-SAMPLE ATTENTION

The Cross-Sample Attention mechanism constructs interaction between support and query so as to suppress the sample-level heterogeneity, as illustrated in Fig. 2 (c ). Specifically, we use linking tokens X link ∈ R C×D (C embedding vectors with D dimensions ) to interact query and the support. At the input block, we initialize two linking tokens X 0 link with the mean features of the foreground and background region of the support sample. Then the linking tokens are updated block by block through cross-attention, which is formulated as: X i+1 link = Att crs (Que(X i link ), Key({X i s , X i q , X i link }), Val({X i s , X i q , X i link })), where X i+1 link denotes the updated linking tokens, {, } denotes the concatenation operation. Que, Key, Val are the operation to calculate query embedding, key embedding and value embedding of support tokens X i s , query tokens X i q and linking tokens X i link in i-th MuHS Block, respectively. Given the updated linking tokens, we then update the support and the query patch tokens by: X i+1 s = Att(Que(X i s ), Key({X i s , X i+1 link }), Val({X i s , X i+1 link })) X i+1 q = Att(Que(X i q ), Key({X i q , X i+1 link }), Val({X i q , X i+1 link })), where X i s , X i q are the D dimensions embedding of support and query tokens in i-th Block. X i+1 link denotes the embedding of Linking tokens updated by Eq. 1. Since the linking tokens already absorb information from all the query and support tokens (Eq. 1), in the subsequent Eq. 2, they propagate the absorbed information onto the support and the query tokens, therefore facilitating a mediate interaction between the support and query samples. Compared with directly constructing patch-to-patch attention between the query and support sample, our solution with linking tokens has the advantage of high efficiency. Specifically, the patch-to-patch attention incurs quadratic complexity against the patch token numbers, while using the linking tokens only incurs linear complexity.

3.3. CROSS-REGION INTERACTION

The Cross-Region Interaction (Fig. 2 (b) ) is appended upon the self-attention layer in Eq. 2 within each MuHS block to encourage the interaction between different regions in the background. To this end, we use the ground-truth label to split the background in each query image (during training) into multiple regions. Specifically, some regions in the background actually belong to some annotated foreground classes but are merged into the background because the current training episode focuses on a different foreground class. We denote these regions as temporary background, and the remaining part (which has no foreground annotations) as constant background. Correspondingly, we use x [tb] and x [cb] to distinguish the tokens from the temporal and the constant background regions, respectively, and use x [f ] to represent the tokens in the foreground-of-interest in the current episode. The Cross-Region Interaction compares the cosine distance between x [tb] and x [cb] after the selfattention layer (Eq. 2) and makes x [cb] absorb information from the dissimilar x [tb] ∈ X [tb] by: x [cb] ′ = x [cb] + sof tmax(1 -x [cb] •X [tb] |x [cb] |•|X [tb] | ) • X [tb] , where x [cb] ′ denotes updated constant background tokens embedding. Region-level triplet loss. Besides the above Cross-Region Interaction smoothing the background across different regions through attention, we further use a region-level triplet loss to pull close the constant background tokens and temporal background tokens on the last MuHS transformer block. The region-level triplet loss is enforced on the final output state of the background tokens by: L tri = max(D(x [tb] , x [cb] ) -D(x [tb] , x [f ] ), 0)), ) where D(., .) is the cosine distance between two tokens.

3.4. MASKED IMAGE SEGMENTATION

The Masked Image Segmentation (MIS) model is appended upon the feature extractor of MuHS. It makes two types of prediction, i.e., a partial prediction from a matching head and a holistic prediction from an additional "decoder + matching head". The details are as below: • Partial prediction: We recall that MuHS randomly discards some patches of the input query image during training. Correspondingly, the output tokens X L q from the L-layer MuHS feature extractor are incomplete. Given these existing output patches, we use a matching head (linear classification head) to make partial prediction in Fig. 2 (a) . Specifically, we calculate the foreground / background mean features with the support tokens X L s and the ground-truth object mask to correspondingly derive the foreground / background proxies w f and w b . During training, the supervision on a specified query token x L q is: L partial (x L q ) = -log yq exp(w T f x L q )+(1-yq) exp(w T b x L q ) exp(w T f x L q )+exp(w T b x L q ) , where y q is corresponding label of the query token x L q (y q =1, if y q belongs to the foreground; otherwise y q =0). • Holistic prediction: In addition, we fill mask tokens with trainable positional embeddings to construct the full set of the query patches (as shown in Fig. 2 (a) ). Therefore, the output tokens X M q from the Masked Image Segmentation model are complete. We make holistic prediction for all the query patches from an additional matching head, the weight matrix from which is calculated by foreground / background mean features with the support tokens X M s from the Masked Image Segmentation model and the ground-truth object mask. The holistic prediction is supervised under cross-entropy loss L holistic , which is similar as in Eq. 5 and is thus omitted here. We note that Masked Image Segmentation model is only utilized for training. During testing, we feed all the query patches into MuHS feature extractor and compare each query token against these proxies to classify each patch into the foreground / background. Overall Training. During the training stage, we successively perform the feature extractor and Masked Image Segmentation model for the query prediction. The overall training loss is as follows: We evaluate the proposed MuHS on two datasets: PASCAL-5 i (Shaban et al. ( 2017)) and COCO-20 i (Nguyen & Todorovic ( 2019)). PASCAL-5 i consists of PASCAL VOC 2012 (Everingham et al. (2010) ) and additionally annotations from SDS (Hariharan et al. (2014) ). We divide 20 classes into 4 splits and each split has 5 classes. During evaluation on one split (5 classes), we have other three splits (15 classes) for training. We randomly sample 1000 pairs of support and query in each split testing. COCO-20 i is built from COCO2014 (Lin et al. (2014) ). We divide 80 classes into 4 splits and each split has 20 classes. During evaluation on one split (20 classes), we have other three splits (60 classes) for training. We randomly sample 20000 pairs of support and query in each split testing. L = L partial + α • L tri + β • L holistic Following previous works (Tian et al. (2020b) ; Zhang et al. (2021b) ), we compare the performance on testing splits by using mean intersection over union (mIoU) as our evaluation metrics. Table 3 : Ablation studies of our proposed method under 1-shot and 5-shot setting.

4.4. COMPATIBILITY TO STATE-OF-THE-ART MATCHING HEADS.

Since the proposed MuHS focuses on learning the feature extractor, we are interested at its compatibility to different FSS matching heads. Specifically, we investigate two popular FSS heads, i.e., PFENet (Tian et al. (2020b) ) and CyCTR (Zhang et al. (2021b) ). We compare our MuHS feature against two frozen (i.e., ResNet50, ResNet101, following their original practice), as well as the DeiT-B feature. The results are summarized in Fig. 3 , from which we draw three observations as below: First, compared with the frozen CNN features, the frozen DeiT-B features considerably decreases accuracy for both the PFENet and CyCTR heads. For example, the frozen DeiT-B feature is inferior than the frozen ResNet50 feature by ↓ 3.1% , ↓ 3.2% mIoU on PFENet and CyCTR, respectively. It is somehow surprising given that the transformer feature usually achieves better discriminative ability than the CNN counterparts. However, we think the above observation is actually reasonable, because both the PFENet and CyCTR heads (as well as most matching heads in prior literature) are specifically designed for CNN features and lack consideration for the transformer features. Second, our MuHS features significantly outperform the CNN features, although the employed heads is specifically designed for the CNN features. For example, using the PFENet head, MuHS feature surpasses the ResNet101 feature by ↑ 7.70%. We infer that although there are some weaknesses for cooperating the transformer feature with these heads (as in observation 1), the benefits of suppressing the heterogeneity in MuHS dominate. Therefore, MuHS brings improvement over the CNN features and shows good compatibility against these matching heads. Third, comparing the achieved mIoU of "MuHS + PFENet (CyCTR) head" against the "MuHS + linear classification head" (in Table . 1), we find that the latter is slightly better. This result is consistent with the above two observations: MuHS is the superior feature for the PFENet and CyCTR heads (observation 2), while those two heads are not the superior heads for MuHS and other transformer features (observation 1). Therefore, we recommend using the simple linear classification head for MuHS currently. That being said, we are optimistic towards a good embrace between MuHS and future matching heads in the FSS community. the heterogeneity in FSS. We investigate their benefits through ablation, as shown in Table 3 . We draw two observations: First, all the three modules can bring considerable improvements, independently. For example, under 1-shot on PASCAL-5 i , adding Cross-Sample Attention, Cross-Region Interaction and Masked Image Segmentation alone improves the baseline by +1.9%, +2.4% and +3.0%, respectively. Second, MuHS integrating all the three components achieves further improvement, e.g., +5.2% on PASCAL-5 i under 1-shot setting. It indicates that these three components suppressing different heterogeneity achieve complementary benefits. We supplement more ablation studies in Sec A.5. Visualization of heterogeneity suppression. We visualize some segmentation results in Fig. 4 to intuitively understand how the proposed MuHS suppresses the heterogeneity and improves FSS. In Fig. 4 (a), due to the sample-level heterogeneity (between cattle and cow), the baseline fails to recall many foreground pixels. In contrast, the proposed MuHS significantly improves the recall on the foreground pixels due to its capacity of suppressing the sample-level heterogeneity. In Fig. 4 (b), the region-level heterogeneity makes the baseline to classify many pixels on the horse into the foreground (person). In contrast, MuHS smooths the background and thus remove the distraction from the horse. In Fig. 4 (c), MuHS successfully merges the upper part of body into the foreground "person" by suppressing patch-level heterogeneity. Additionally, we visualize the feature distribution before and after eliminating heterogeneity in Sec A.4 and plot the convergence curves in Sec A.6.

5. CONCLUSION

This paper proposes a feature extractor with Multi-level Heterogeneity Suppressing (MuHS) for fewshot semantic segmentation(FSS). Based on the transformer backbone, MuHS sets up novel crosssample attention, cross-region interaction and the masked image segmentation. The cross-sample attention efficiently propagates information across different samples. The cross-region interaction facilitates feature absorption between dissimilar regions within the background. The masked image segmentation utilizes the contextual information to infer the labels for discarded (masked) patch tokens so as to reinforce the capacity of contextual inference. These three modules respectively suppress the heterogeneity from three different levels, therefore improving the intra-class compactness of the FSS features. Extensive experiments on two popular FSS datasets demonstrate the effectiveness of MuHS and the achieved results are on par with the state of the art.

A APPENDIX

In the appendix, we supply the details which are not described in the main text due to space limitation. In Section A.1, we analyze the impact of some hyper-parameters. In Section A.2, we add more implementation details. In Section A.3, we adopt one more dataset to evaluate performance. In Section A.4, we compare the feature distribution before and after eliminating heterogeneity. In Section A.5, we supply more ablation experiments. In Section A.6, we plot the convergence curves of the proposed MuHS and recent state-of-the-art methods. A.1 HYPER-PARAMETERS ANALYSIS. We analyze the impact of important hyper-parameters, i.e., α and β in Eq. 6. and investigate the impact of model depth and masked ratio of the proposed Masked Image Segmentation (MIS) in Section 3.4. The experiments are reported on split-0 of PASCAL-5 i , under 1-shot setting. In Fig. 5 (a), we evaluate the impact of α, which controls the weight of triplet loss in Eq. 6. We observe that the accuracy first increases (when α increases from 0 to 0.6) and then decreases (when α further increases to 1.0). We set α = 0.6 as the weight factor. In Fig. 5 (b), we evaluate the impact of β, which controls the weight of holistic prediction loss in Eq. 6. We observe that the accuracy first increases (when β increases from 0 to 0.5) and then decreases (when β further increases to 1.0). We set β = 0.5 as the weight factor. In Table 4 , we analyse the impact of Masked Image Segmentation (MIS) model depth. We observe that the 7-layer MIS model can achieve the best accuracy. In Table 5 , we analyse the impact of masked ratio for MIS model. We observe that randomly masking out 7% patches can achieve the best accuracy.

MIS

A.2 IMPLEMENTATION DETAILS. We supply the implementation details of how to transform a patch token into pixel-wise mask map and the scheme to generate the discarded patches. A.5 ABLATIONS ON MORE VARIANTS. We recall that all the three components (i.e., cross-sample attention, cross-region interaction and masked image segmentation) can bring considerable improvements, independently and integrating all of them achieves further improvements. For better investigating the efficiency, we supply more ablation studies by adding three more variants, as shown in Table 7 . Each variant combines two components out of cross-sample attention, cross-region interaction and masked image segmentation can still improve the baseline. Cross In Fig. 7 , we plot the convergence curves of CyCTR, PFENet and MuHS. Compared with CyCTR and PFENet, MuHS costs 4 × less training epochs (50 epochs) with fewer time (8 hours) and achieves higher accuracy on PASCAL-5 i . The comparison is based on Pytorch and NVIDIA A100 GPU. 



Figure 1: The heterogeneity from three levels. (a)shows the sample-level heterogeneity between the support and the query. The "cow" in the support is adopted to segment "cattle" in the query, in spite of their different appearance. (b) shows the region-level heterogeneity in the background. When the foreground object is the "rider", the "horse" should share the same class (BG:background) with "grass". (c) shows the patch-level heterogeneity among neighboring patches. The color of upper and lower part of body is different.

Transformers for Visual Recognition. Recently, transformers are introduced to computer vision tasks, e.g., image classification (Dosovitskiy et al. (2020); Vaswani et al. (2021)), segmentation (Wang et al. (2021); Xie et al. (2021a); Li et al. (2021b; 2022); Zhou et al. (2021)), detection(Carion et al. (2020);Zhu et al. (2020);Bar et al. (2022)) and have shown promising performance.

Figure 2: Overview of the proposed MuHS. (a) Following the vision transformer backbone, MuHS splits the support and query images into patches, embeds each patch into a patch token and feeds all the patch tokens into the stacked transformer block. Based on the backbone, MuHS 1) inserts a Cross-Sample Attention module between the query and support features to suppress the samplelevel heterogeneity (as in Section 3.2), 2) appends an extra Cross-Region Interaction module upon the original self-attention layer to suppress the region-level heterogeneity (as in Section 3.3), and 3) integrates a novel Masked Image Segmentation task to suppress the patch-level heterogeneity (as in Section 3.4). (b) The detailed structure of the transformer block with Cross-Region Interaction. (c) The detailed structure of the Cross-Sample Attention.

Figure 3: MuHS feature is compatible to multiple FSS matching heads Components PASCAL-5 i COCO-20 i 1-shot 5-shot 1-shot 5-shot Baseline 63.9 74.8 43.1 53.4 Baseline + Cross-Sample Attention 65.8 75.3 45.8 54.7 Baseline + Cross-Region Interaction 66.3 75.5 46.1 54.8 Baseline + Masked Image Segmentation 66.9 75.8 46.3 55.1 All (MuHS) 69.1 76.7 47.4 56.7

Figure 4: Visualization of the segmentation results on PASCAL-5 i . (a) MuHS suppresses the sample-level heterogeneity between cattle (query) and cow (support), therefore improving the recall of the "cattle" pixels in the query. (b) MuHS suppresses the region-level heterogeneity and merges "horse" into the background of the query, when the support object is "person". (c) Due to its capacity of suppressing the patch-level heterogeneity, MuHS recognizes the upper part of body, although there is no such a clue (i.e., white upper body) from the support.

Figure 5: Analysis on the hyper-parameters.

Figure 7: The convergence curves of recent methods and MuHS on split-0 of PASCAL-5 i .

Third, we conduct extensive experiments to validate the effectiveness of the proposed MuHS. Experimental results confirm that MuHS is compatible to multiple FSS heads and achieves new state of the art using a simple linear classification head.

Analysis on the masked ratio for MIS

-Sample Atten Cross-Region Inter MIS PASCAL-5 i Ablation studies on more variants of three components for PASCAL-5 i .

4.2. IMPLEMENTATION DETAILS

We adopt DeiT-B/16 (Touvron et al. (2021) ) (pretrained on Imagenet (Deng et al. (2009) )) as our backbone. We use SGD optimizer and set the learning rate as 9e-4. We randomly crop images to 480 × 480 and follow the data augmentation in PFENet (Tian et al. (2020b) ). For PASCAL-5 i , we train 50 epochs with batch size 4. For COCO-20 i , we train 30 epochs and set batch size to 16. The proposed MuHS is trained on Pytorch with 4 NVIDIA A100 GPUS. More details are in Sec A.2.

4.3. COMPARISON WITH THE STATE OF THE ARTS

We compare MuHS with the existing state of the arts on PASCAL-5 i and COCO-20 i . The results on two datasets are summarized in Table . 1 and Table . 2, respectively.From Table . 1, we clearly observe the superiority of MuHS on PASCAL-5 i . First, comparing MuHS against the DeiT-B baseline, we find MuHS achieves considerable improvements. For example, under the 1-shot and 5-shot settings, MuHS outperforms the DeiT-B baseline by +5.2% and +1.9% mIoU, respectively. Second, MuHS surpasses all the state-of-the-art methods by a clear margin (especially under the 5-shot setting). For example, MuHS clearly outperforms the most competing method BAM (Lang et al. (2022) ) by +1.3% and +5.8% mIoU on 1-shot and 5-shot, respectively. We note that all the competing methods use sophisticated matching heads, while our MuHS only uses a simple linear classification head. We thus infer that the superiority of our MuHS mainly comes from the discriminative features, indicating that MuHs is a strong feature extractor for FSS.The observation on COCO-20 i (Table . 2) is similar, i.e., MuHS improves the DeiT-B baseline and presents superiority against all the competing methods. For example, under the 5-shot settings, MuHS surpasses the DeiT-B baseline by +3.3% mIoU and surpasses BAM (Lang et al. (2022) ) by +5.5%, mIoU, respectively.

ETHICS STATEMENT

This paper can help to improve the semantic segmentation accuracy with the limited labeled samples. It may be applied to automatic driving system to improve safety when the system needs to recognize unseen objects. We will explore more applications of few-shot segmentation. Moreover, we will try to improve the reliability of few-shot segmentation systems to reduce potential problems.

REPRODUCIBILITY STATEMENT

The MuHS is reproducible. In the main text, we describe the two datasets for evaluation, (i.e., PASCAL-5 i and COCO-20 i ) and the details about the experimental implementation. We supply the analysis of some hyper parameters in appendix.• To transform a patch into pixel-wise mask map, we follow the common practice (PANet (Wang et al. (2019) ), PFENet (Tian et al. (2020b) )) of spatially up-sampling. Specially, we first obtain the softmax scores for each patch token through the classification head. Then, the score maps are upsampled through bi-linear interpolation to match the size of the input image. Finally, we use argmax operation to generate pixel-wise mask map.• To generate the discarded patches, we randomly shuffle the patch tokens and then mask the rear of the token sequence. This operation is the same as in other MIM methods (e.g., MAE (He et al. (2022) ) ).A.3 EVALUATION ON MORE DATASETS.We evaluate one more dataset, i.e., Cityscapes (Cordts et al. (2016) ), an urban street-scene dataset.We use 15 classes (out of the commonly-used 19 classes) to construct the base set and use the other 4 classes (i.e., "sky", "person", "car", "bicycle") for the novel set.Baseline CyCTR MuHS Cityscapes 13.1 15.2 25.2Table 6 : Comparison with the state of the art and the baseline on cityscapes for 1-shot setting.Based on this newly-generated few-shot segmentation benchmark, we compare the proposed MuHS against the baseline and a most recent state-of-the-art method CyCTR (Zhang et al. (2021b) ). From the Table 6 , we observe MuHS significantly surpasses the baseline and CyCTR by +12.1% and +10% mAP, respectively. We note that the cityscapes is challenging for few-shot segmentation task. We infer it is because in cityscapes, the number of semantic classes appearing in a single image is much larger, therefore increasing the challenge from region-level heterogeneity.A.4 COMPARISON OF THE FEATURE DISTRIBUTION.To better understand how MuHS suppresses the three types of heterogeneity, we use t-SNE visualization to compare the feature distribution before and after MuHS in Fig. 6 . We additionally use Green to denote foreground features of the support to visualize the distribution gap between the support and the query.We correspondingly draw two observations. First, the intra-class distributions of both background and the foreground become more compact, indicating that MuHS effectively suppresses the regionlevel and patch-level heterogeneity. Second, the foreground from the support and query images become closer, indicating that MuHS effectively suppresses the sample-level heterogeneity. These observations are consistent with the segmentation examples in Fig. 4 . This experiment also intuitively validates that MuHS improves intra-class feature compactness.

