UNSUPERVISED SEMANTIC SEGMENTATION WITH SELF-SUPERVISED OBJECT-CENTRIC REPRESENTA-TIONS

Abstract

In this paper, we show that recent advances in self-supervised representation learning enable unsupervised object discovery and semantic segmentation with a performance that matches the state of the field on supervised semantic segmentation 10 years ago. We propose a methodology based on unsupervised saliency masks and self-supervised feature clustering to kickstart object discovery followed by training a semantic segmentation network on pseudo-labels to bootstrap the system on images with multiple objects. We show that while being conceptually simple our proposed baseline is surprisingly strong. We present results on PASCAL VOC that go far beyond the current state of the art (50.0 mIoU) , and we report for the first time results on MS COCO for the whole set of 81 classes: our method discovers 34 categories with more than 20% IoU, while obtaining an average IoU of 19.6 for all 81 categories. Figure 1: Unsupervised semantic segmentation predictions on PASCAL VOC (Everingham et al., 2012). Our COMUS does not use human annotations to discover objects and their precise localization. In contrast to the prior state-of-the-art method MaskContrast (Van Gansbeke et al., 2021), COMUS yields more precise segmentations, avoids confusion of categories, and is not restricted to only one object category per image.

1. INTRODUCTION

The large advances in dense semantic labelling in recent years were built on large-scale humanannotated datasets (Everingham et al., 2012; Lin et al., 2014; Cordts et al., 2016) . These supervised semantic segmentation methods (e.g., Ronneberger et al., 2015; Chen et al., 2018) require costly human annotations and operate only on a restricted set of predefined categories. Weakly-supervised segmentation (Pathak et al., 2015; Wei et al., 2018) and semi-supervised segmentation (Mittal et al., 2019; Zhu et al., 2020) approach the issue of annotation cost by reducing the annotation to only a class label or to a subset of labeled images. However, they are still bound to predefined labels. In this paper, we follow a recent trend to move away from the external definition of class labels and rather try to identify object categories automatically by letting the patterns in the data speak. This could be achieved by (1) exploiting dataset biases to replace the missing annotation, (2) a way to get the learning process kickstarted based on "good" samples, and (3) a bootstrapping process that iteratively expands the domain of exploitable samples. A recent method that exploits dataset biases, DINO (Caron et al., 2021) , reported promising effects of self-supervised feature learning in conjunction with a visual transformer architecture by exploiting the object-centric bias of ImageNet with a multi-crop strategy. Their paper emphasized particularly the object-centric attention maps on some samples. We found that the attention maps of their DINO approach are not strong enough on a broad enough set of images to kickstart unsupervised semantic segmentation (see Fig. 4 ), but their learned features within an object region yield clusters of surprisingly high purity and align well with underlying object categories (see Fig. 3 ). Thus, we leverage unsupervised saliency maps from DeepUSPS (Nguyen et al., 2019) and BAS-Net (Qin et al., 2019) to localize foreground objects and to extract DINO features from these foreground regions. This already enables unsupervised semantic segmentation on images that show a dominant object category together with an unspectacular background as they are common in PAS-CAL VOC (Everingham et al., 2012) . However, on other datasets, such as MS COCO (Lin et al., 2014) , most objects are in context with other objects. Even on PASCAL VOC, there are many images with multiple different object categories. For extending to more objects, we propose training a regular semantic segmentation network on the obtained pseudo-masks and to further refine this network by self-training it on its own outputs. Our method, dubbed COMUS (for Clustering Object Masks for learning Unsupervised Segmentation), allows us to segment objects also in multi-object images (see Figure 1 ), and it allows us for the first time to report unsupervised semantic segmentation results on the full 80 category MS COCO dataset without any human annotations. While there are some hard object categories that are not discovered by our proposed procedure, we obtain good clusters for many of COCO object categories. Our contributions can be summarized as follows: 1. We propose a strong and simple baseline method (summarized in Figure 2 ) for unsupervised discovery of object categories and unsupervised semantic segmentation in real-world multiobject image datasets. 2. We show that unsupervised segmentation can reach quality levels comparable to supervised segmentation 10 years ago (Everingham et al., 2012) . This demonstrates that unsupervised segmentation is not only an ill-defined academic playground. 3. We perform extensive ablation studies to analyze the importance of the individual components in our proposed pipeline, as well as bottlenecks to identify good directions to further improve the quality of unsupervised object discovery and unsupervised semantic segmentation.

2. RELATED WORK

There are several research directions that try to tackle the challenging task of detecting and segmenting objects without any, or with only few, human annotations. Unsupervised Semantic Segmentation The first line of work (Van Gansbeke et al., 2021; He et al., 2022; Cho et al., 2021; Ji et al., 2019; Hwang et al., 2019; Ouali et al., 2020; Hamilton et al., 2022; Ke et al., 2022) aims to learn dense representations for each pixel in the image and then cluster them (or their aggregation from pixels in the foreground segments) to get each pixel label. While learning semantically meaningful dense representations is an important task itself, clustering them directly to obtain semantic labels seems to be a very challenging task (Ji et al., 2019; Ouali et al., 2020) . Thus, usage of additional priors or inductive biases could simplify dense rep-

