UNSUPERVISED SEMANTIC SEGMENTATION WITH SELF-SUPERVISED OBJECT-CENTRIC REPRESENTA-TIONS

Abstract

In this paper, we show that recent advances in self-supervised representation learning enable unsupervised object discovery and semantic segmentation with a performance that matches the state of the field on supervised semantic segmentation 10 years ago. We propose a methodology based on unsupervised saliency masks and self-supervised feature clustering to kickstart object discovery followed by training a semantic segmentation network on pseudo-labels to bootstrap the system on images with multiple objects. We show that while being conceptually simple our proposed baseline is surprisingly strong. We present results on PASCAL VOC that go far beyond the current state of the art (50.0 mIoU) , and we report for the first time results on MS COCO for the whole set of 81 classes: our method discovers 34 categories with more than 20% IoU, while obtaining an average IoU of 19.6 for all 81 categories. Figure 1: Unsupervised semantic segmentation predictions on PASCAL VOC (Everingham et al., 2012). Our COMUS does not use human annotations to discover objects and their precise localization. In contrast to the prior state-of-the-art method MaskContrast (Van Gansbeke et al., 2021), COMUS yields more precise segmentations, avoids confusion of categories, and is not restricted to only one object category per image.

1. INTRODUCTION

The large advances in dense semantic labelling in recent years were built on large-scale humanannotated datasets (Everingham et al., 2012; Lin et al., 2014; Cordts et al., 2016) . These supervised * Work done during internship at Amazon. 1

