TRANSFORMER-BASED OPEN-WORLD INSTANCE SEG-MENTATION WITH CROSS-TASK CONSISTENCY REGU-LARIZATION

Abstract

Open-World Instance Segmentation (OWIS) is an emerging research topic that aims to segment class-agnostic object instances from images. The mainstream approaches use a two-stage segmentation framework, which first locates the candidate object bounding boxes and then performs instance segmentation. In this work, we instead promote a single-stage transformer-based framework for OWIS. We argue that the end-to-end training process in the single-stage framework can be more convenient for directly regularizing the localization of class-agnostic object pixels. Based on the transformer-based instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss. We show that such a consistency loss could alleviate the problem of incomplete instance annotationa common problem in the existing OWIS datasets. We also show that the proposed loss lends itself to an effective solution to semi-supervised OWIS that could be considered an extreme case that all object annotations are absent for some images. Our extensive experiments demonstrate that the proposed method achieves impressive results in both fully-supervised and semi-supervised settings. Compared to SOTA methods, the proposed method significantly improves the AP 100 score by 4.75% in UVO dataset →UVO dataset setting and 4.05% in COCO dataset →UVO dataset setting. In the case of semi-supervised learning, our model learned with only 30% labeled data, even outperforms its fully-supervised counterpart with 50% labeled data. The code will be released soon.

1. INTRODUCTION

Traditional instance segmentation Lin et al. (2014) ; Cordts et al. (2016) methods often assume that objects in images can be categorized into a finite set of predefined classes (i.e., closed-world). Such an assumption, however, can be easily violated in many real-world applications, where models will encounter many new object classes that never appeared in the training data. Note that our work is not just a straightforward adaptation of Mask2Former from close-world to open-world. This is because unlike closed-world segmentation, where the object categories can be clearly defined before annotation, the open-world scenario makes it challenging for annotators to label all instances completely or ensure annotation consistency across different images because they cannot have a well-defined finite set of object categories. As shown in Figure 1 (a), annotators miss some instances. It still remains challenging that how to handle such incomplete annotations (i.e. some instances missed).



Therefore, researchers recently attempted to tackle the problem of Open-World Instance Segmentation (OWIS) Wang et al. (2021), which targets class-agnostic segmentation of all objects in the image. Prior to this paper, most existing methods for OWIS are of two-stage Wang et al. (2022); Saito et al. (2021), which detect bounding boxes of objects and then segment them. Despite their promising performances, such a paradigm cannot handle and recover if object bounding boxes are not detected. In contrast, a transformer-based approach called Mask2Former Cheng et al. (2022) has recently been introduced, yet only for closed-world instance segmentation. Based on the Mask2Former, we propose a Transformer-based Open-world Instance Segmentation method named TOIS.

