SUPPRESSING THE HETEROGENEITY: A STRONG FEATURE EXTRACTOR FOR FEW-SHOT SEGMENTATION

Abstract

This paper tackles the Few-shot Semantic Segmentation (FSS) task with focus on learning the feature extractor. Somehow the feature extractor has been overlooked by recent state-of-the-art methods, which directly use a deep model pretrained on ImageNet for feature extraction (without further fine-tuning). Under this background, we think the FSS feature extractor deserves exploration and observe the heterogeneity (i.e., the intra-class diversity in the raw images) as a critical challenge hindering the intra-class feature compactness. The heterogeneity has three levels from coarse to fine: 1) Sample-level: the inevitable distribution gap between the support and query images makes them heterogeneous from each other. 2) Region-level: the background in FSS actually contains multiple regions with different semantics. 3) Patch-level: some neighboring patches belonging to a same class may appear quite different from each other. Motivated by these observations, we propose a feature extractor with Multi-level Heterogeneity Suppressing (MuHS). MuHS leverages the attention mechanism in transformer backbone to effectively suppress all these three-level heterogeneity. Concretely, MuHS reinforces the attention / interaction between different samples (query and support), different regions and neighboring patches by constructing cross-sample attention, cross-region interaction and a novel masked image segmentation (inspired by the recent masked image modeling), respectively. We empirically show that 1) MuHS brings consistent improvement for various FSS heads and 2) using a simple linear classification head, MuHS sets new states of the art on multiple FSS datasets, validating the importance of FSS feature learning.

1. INTRODUCTION

Few-shot semantic segmentation (FSS) aims to generalize the semantic segmentation model from base classes to novel classes, using very few support samples. FSS depicts a potential to reduce the notoriously expensive pixel-wise annotation and has thus drawn great research interest. However, we observe that the current research has been biased towards partial component of the FSS framework. Concretely, an FSS framework typically consists of a feature extractor and a matching head, while the recent state-of-the-art methods (Zhang et al. ( 2019 2020)) all focus on the matching head. They pay NO effort on learning the feature extractor and adopt a ImageNet-pretrained model without any fine-tuning. Under this background, we think the FSS feature extractor deserves exploration and take a rethink on the corresponding challenge. Some prior literature (Tian et al. (2020b); Zhang et al. (2021b) ) argue that the challenge is mainly because the limited support samples are insufficient for finetuning a large feature extractor (e.g., ResNet-50 (He et al. (2016) )), therefore leading to the overfitting problem. We hold a different perspective and observe the heterogeneity (i.e., the intra-class diversity in the raw images) as a critical challenge hindering the intra-class compactness of FSS features. Although the heterogeneity is not a unique problem in FSS (e.g., it does exist in the



); Tian et al. (2020b); Li et al. (2021a); Xie et al. (2021b); Wu et al. (2021); Zhang et al. (2021a); Li et al. (

