MULTI-INSTANCE INTERACTIVE SEGMENTATION WITH SELF-SUPERVISED TRANSFORMER

Abstract

The rise of Vision Transformers (ViT) combined with better self-supervised learning pre-tasks has taken representation learning to the next level, beating supervised results on ImageNet. In particular, self-attention mechanism of ViT allows to easily visualize semantic information learned by the network. Following revealing of attention maps of DINO, many tried to leverage its representations for unsupervised segmentation. Despite very promising results for basic images with a single clear object in a simple background, representation of ViT are not able to segment images, with several classes and object instance, in an unsupervised fashion yet. In this paper, we propose SALT: Semi-supervised Segmentation with Selfsupervised Attention Layers in Transformers, an interactive algorithm for multiclass/multi-instance segmentation. We follow previous works path and take it a step further by discriminating between different objects, using sparse human help to select said objects. We show that remarkable results are achieved with very sparse labels. Different pre-tasks are compared, and we show that self-supervised ones are more robust for panoptic segmentation, and overall achieve very similar performance. Evaluation is carried out on Pascal VOC 2007 and COCO-panoptic. Performance is evaluated for extreme conditions such as very noisy, and sparse interactions going to as little as one interaction per class.

1. INTRODUCTION

The last ten years have seen the rise of computer vision tasks such as localization and segmentation. As a result, technologies such as autonomous driving or robotics have met great success at the expense of annotating huge enough datasets. Indeed, state-of-the-art approaches are all based on training a neural network in a supervised fashion (Strudel et al. 2021 , Xie et al. 2021) . Although this might work well in areas where there are enough resources to label million of images, there are others where there are almost no labels but data is already available in large quantities. For instance, in fields such as astronomy, sometimes one is limited by the amount of available ground truth labels (Pasquet et al., 2019) . On other ones like medical imaging, data needs to be labeled by professionals, which is very expensive. Therefore, leveraging unlabeled data is a necessity in many computer vision tasks. Numerous attempts exist in the literature to solve this problem, such as semi-supervised learning (Kipf & Welling, 2017), weakly supervised learning (Strudel et al., 2022) , and active learning (Aghdam et al., 2019) . These methods can achieve some improvement, but still need slight supervision. More recently, self-supervised pre-tasks have leveraged representation power of Vision Transformer (ViT: Dosovitskiy et al. 2021) in a similar fashion to what has been done in NLP. Indeed, an image can be seen as a sequence of p × p patches. Transformers have recently outperformed convolutional neural networks, and results with self supervised pre-task DINO (Caron et al., 2021) have shown impressive salient regions in the attention maps from the class token in the last ViT layer. This has led authors to test the unsupervised foreground detection capabilities of such representations (Wang et al. 2022; Amir et al. 2021) . Authors tried clever ways to cluster these feature representations to split foreground and background regions. Melas-Kyriazi et al. (2022) went one step ahead and tried to do this for more than one foreground object. However, these applications are limited to simple images with a clear background and very few salient objects. 1

