TRANSFORMER-BASED OPEN-WORLD INSTANCE SEG-MENTATION WITH CROSS-TASK CONSISTENCY REGU-LARIZATION

Abstract

Open-World Instance Segmentation (OWIS) is an emerging research topic that aims to segment class-agnostic object instances from images. The mainstream approaches use a two-stage segmentation framework, which first locates the candidate object bounding boxes and then performs instance segmentation. In this work, we instead promote a single-stage transformer-based framework for OWIS. We argue that the end-to-end training process in the single-stage framework can be more convenient for directly regularizing the localization of class-agnostic object pixels. Based on the transformer-based instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss. We show that such a consistency loss could alleviate the problem of incomplete instance annotationa common problem in the existing OWIS datasets. We also show that the proposed loss lends itself to an effective solution to semi-supervised OWIS that could be considered an extreme case that all object annotations are absent for some images. Our extensive experiments demonstrate that the proposed method achieves impressive results in both fully-supervised and semi-supervised settings. Compared to SOTA methods, the proposed method significantly improves the AP 100 score by 4.75% in UVO dataset →UVO dataset setting and 4.05% in COCO dataset →UVO dataset setting. In the case of semi-supervised learning, our model learned with only 30% labeled data, even outperforms its fully-supervised counterpart with 50% labeled data. The code will be released soon.

1. INTRODUCTION

Traditional instance segmentation Lin et al. (2014) ; Cordts et al. (2016) methods often assume that objects in images can be categorized into a finite set of predefined classes (i.e., closed-world) . Such an assumption, however, can be easily violated in many real-world applications, where models will encounter many new object classes that never appeared in the training data. Note that our work is not just a straightforward adaptation of Mask2Former from close-world to open-world. This is because unlike closed-world segmentation, where the object categories can be clearly defined before annotation, the open-world scenario makes it challenging for annotators to label all instances completely or ensure annotation consistency across different images because they cannot have a well-defined finite set of object categories. As shown in Figure 1 (a), annotators miss some instances. It still remains challenging that how to handle such incomplete annotations (i.e. some instances missed). In contrast, our proposed TOIS method is end-to-end and simpler. We address this incomplete annotation issue via a novel regularization module, which is simple yet effective. Specifically, it is convenient to concurrently predict not only (1) instance masks but also a (2) foreground map. Ideally, as shown in Figure 1(b) , the foreground region should be consistent with the union of all instance masks. To penalize their inconsistency, we devise a cross-task consistency loss, which can down-weight the adverse effects caused by incomplete annotation. This is because when an instance is missed in annotation, as long as it is captured by both our predictions of instance masks and foreground map, the consistency loss would be low and hence encourage such prediction. Experiments in Figure 1 (g) show that such consistency loss is effective even when annotations miss many instances. And as in Figure 1 (c), novel objects which are unannotated in training set have been segmented successfully by our method. So far, like most existing methods, we focus on the fully-supervised OWIS. In this paper, we further extend OWIS to the semi-supervised setting, where some training images do not have any annotations at all. This is of great interest because annotating segmentation map is very costly. Notably, our proposed regularization module can also benefit semi-supervised OWIS -consider an unlabeled image as an extreme case of incomplete annotation where all of the instance annotations are missed. Specifically, we perform semi-supervised OWIS by first warming up the network on the labeled set and then continuing training it with the cross-task consistency loss on the mixture of labeled and unlabeled images. Contributions. In a nutshell, our main contributions could be summarized as:



Therefore, researchers recently attempted to tackle the problem of Open-World Instance Segmentation (OWIS) Wang et al. (2021), which targets class-agnostic segmentation of all objects in the image. Prior to this paper, most existing methods for OWIS are of two-stage Wang et al. (2022); Saito et al. (2021), which detect bounding boxes of objects and then segment them. Despite their promising performances, such a paradigm cannot handle and recover if object bounding boxes are not detected. In contrast, a transformer-based approach called Mask2Former Cheng et al. (2022) has recently been introduced, yet only for closed-world instance segmentation. Based on the Mask2Former, we propose a Transformer-based Open-world Instance Segmentation method named TOIS.

Figure 1: (a). Instances missing annotations in COCO and UVO datasets. The regions in red boxes are mistakenly annotated as background. (b). Motivation of our novel reg module (The consistency relationship between instance mask and foreground map). (c). Visualization results of our TOIS on UVO dataset. Here, the proposed TOIS is trained on COCO dataset and tested on UVO dataset. Our methods correctly segments many objects that are not labeled in COCO. (d -f). The AP 100 % of our TOIS vs.SOTA methods on COCO→UVO, Cityscapes→Mapillary, COCO→UVO. (g). The AR 100 % of our TOIS vs. baseline Mask2Former on COCO. From right to left, with the total number of classes decreases (i.e. more instance annotations missed), the gain of our TOIS over baseline becomes larger, thanks to the capability of our model to handle incomplete annotations.

