MULTI-INSTANCE INTERACTIVE SEGMENTATION WITH SELF-SUPERVISED TRANSFORMER

Abstract

The rise of Vision Transformers (ViT) combined with better self-supervised learning pre-tasks has taken representation learning to the next level, beating supervised results on ImageNet. In particular, self-attention mechanism of ViT allows to easily visualize semantic information learned by the network. Following revealing of attention maps of DINO, many tried to leverage its representations for unsupervised segmentation. Despite very promising results for basic images with a single clear object in a simple background, representation of ViT are not able to segment images, with several classes and object instance, in an unsupervised fashion yet. In this paper, we propose SALT: Semi-supervised Segmentation with Selfsupervised Attention Layers in Transformers, an interactive algorithm for multiclass/multi-instance segmentation. We follow previous works path and take it a step further by discriminating between different objects, using sparse human help to select said objects. We show that remarkable results are achieved with very sparse labels. Different pre-tasks are compared, and we show that self-supervised ones are more robust for panoptic segmentation, and overall achieve very similar performance. Evaluation is carried out on Pascal VOC 2007 and COCO-panoptic. Performance is evaluated for extreme conditions such as very noisy, and sparse interactions going to as little as one interaction per class.

1. INTRODUCTION

The last ten years have seen the rise of computer vision tasks such as localization and segmentation. As a result, technologies such as autonomous driving or robotics have met great success at the expense of annotating huge enough datasets. Indeed, state-of-the-art approaches are all based on training a neural network in a supervised fashion (Strudel et al. 2021 , Xie et al. 2021) . Although this might work well in areas where there are enough resources to label million of images, there are others where there are almost no labels but data is already available in large quantities. For instance, in fields such as astronomy, sometimes one is limited by the amount of available ground truth labels (Pasquet et al., 2019) . On other ones like medical imaging, data needs to be labeled by professionals, which is very expensive. Therefore, leveraging unlabeled data is a necessity in many computer vision tasks. Numerous attempts exist in the literature to solve this problem, such as semi-supervised learning (Kipf & Welling, 2017) , weakly supervised learning (Strudel et al., 2022) , and active learning (Aghdam et al., 2019) . These methods can achieve some improvement, but still need slight supervision. More recently, self-supervised pre-tasks have leveraged representation power of Vision Transformer (ViT: Dosovitskiy et al. 2021) in a similar fashion to what has been done in NLP. Indeed, an image can be seen as a sequence of p × p patches. Transformers have recently outperformed convolutional neural networks, and results with self supervised pre-task DINO (Caron et al., 2021) have shown impressive salient regions in the attention maps from the class token in the last ViT layer. This has led authors to test the unsupervised foreground detection capabilities of such representations (Wang et al. 2022; Amir et al. 2021) . Authors tried clever ways to cluster these feature representations to split foreground and background regions. Melas-Kyriazi et al. (2022) went one step ahead and tried to do this for more than one foreground object. However, these applications are limited to simple images with a clear background and very few salient objects. We believe that the representation power of self-supervised ViTs can be pushed much further with very sparse human interactions. Here, we try to discriminate between different objects, using sparse human help to select said objects. Our goal is twofold, we want to assess the representation power of self-supervised ViTs, and at the same time create an interactive segmentation algorithm that is not powered by a supervised learning algorithm, but still harnesses information from a dataset. Indeed, if one has a huge unlabeled dataset, then self-supervised learning could be first used to derive meaningful representations, and our algorithm could help label images in just a few seconds. In this paper, we propose SALT: Semi-supervised Segmentation with Self-supervised Attention Layers in Transformers. A graph based semi-supervised approach that harnesses the representation power of self-supervised ViTs to create segmentation masks from sparse human interactions. We test segmentation performance on Pascal VOC 2007 (Everingham et al., 2010) , and a modified version of COCO-panoptic (Lin et al., 2015b ) that contains only big things, hereafter COCO big things. However, because of the nature of interactive segmentation algorithms themselves, we will not be comparing to other algorithms. Indeed, the major obstacle is that algorithms usually take different forms of inputs, and can also be iterative. To the best of our knowledge, there are no interactive segmentation methods that share the same input as ours. For each dataset, we will create an interaction dataset with human inputs, and we will craft elaborated evaluation protocols that show what can be realistically expected from our algorithm, as well as its limitations. We study the performance of our algorithm in extreme conditions, which can be interesting for some unsupervised tasks where we have very sparse information of the position of an object. To the best of our knowledge, this is the first unsupervised interactive segmentation algorithm that is able to handle many panoptic classes simultaneously, while achieving pleasing results. Although results are still far below the performance of supervised state-of-the-art algorithms, this work shows the potential of future ViTs for zero-shot interactive segmentation, and eventually unsupervised segmentation. The paper is organized as follows. In section 2, we present the prerequisites. In section 3, we explain our method. In section 4, we present the modifications made to COCO-panoptic, and how we gathered interactions. In section 5, we compare different pre-trained ViTs, and evaluate the robustness of our method to noise and interactions sparsity up to one patch per class. Finally, in section 6 we present our conclusions.

2. RELATED WORK

Vision Transformer. Transformer architecture (Vaswani et al., 2017) has become the default architecture for natural language processing (NLP) since it was first introduced five years ago. It was not until very recently that computer vision started transitioning from Convolutional Neural Networks (LeCun et al., 1998) to Transformers. Pioneer works tried to implement the self-attention mechanism within CNNs (Hu et al. 2019; Ramachandran et al. 2019; Zhao et al. 2020 ). Dosovitskiy et al. (2021) ultimately released the Vision Transformer, using 16×16 patches as tokens, and almost the same encoder as in the original Trasformer. Since they first appeared, a lot of strategies have been developed to train ViTs more effectively (Beyer et al. 2022 , Touvron et al. 2021 , 2022) , as well as variants (Liu et al., 2021) . Besides, many recent work have shown that ViTs trained in a self-supervised fashion (Caron et al. 2021 , Bao et al. 2021 , Assran et al. 2022) outperform their supervised counterpart. Self-supervised ViT attention maps have also shown high semantic comprehension. Self-supervised learning. In recent years, different clever pre-task have been developed to exploit unlabeled data and pre-train models in a self-supervised fashion. Pioneering works designed ingenious pretext tasks to exploit internal structures of data, such as patch ordering prediction (Noroozi & Favaro, 2017) , recovering colors from grayscale images (Zhang et al., 2016) , image rotation prediction (Gidaris et al., 2018) , etc. Nowadays, most approaches fall into one of two categoriers: generative or discriminative. Generative methods are usually based on masked image encoding (He et al. 2021 , Bao et al. 2021) . Contrastive learning usually uses siamese networks to discriminate between two views of an image (Chen et al. 2020 , He et al. 2020 , Grill et al. 2020 , Caron et al. 2021 , Assran et al. 2022) . Unsupervised Segmentation. Earlier methods mostly used color/background constraints (Cheng et al. 2014 , Wei et al. 2012) . Recently, methods based on extracting features using a self-supervised Transformer (Dosovitskiy et al., 2021) based on DINO (Caron et al., 2021) significantly improved

