OPEN-VOCABULARY PANOPTIC/UNIVERSAL SEGMEN-TATION WITH MASKCLIP

Abstract

In this paper, we tackle an emerging computer vision task, open-vocabulary panoptic segmentation, that aims to perform panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning nor distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamless integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained dense/local CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. We obtain encouraging results for open-vocabulary panoptic/instance segmentation and state-of-the-art results for semantic segmentation on ADE20K and PASCAL datasets. We show qualitative illustration for MaskCLIP with online custom categories.

1. INTRODUCTION

Panoptic segmentation (Kirillov et al., 2019b) or image parsing (Tu et al., 2005) integrates the task of semantic segmentation (Tu, 2008) for background regions (e.g. "stuff" like "road", "sky") and instance segmentation (He et al., 2017) for foreground objects (e.g. "things" such as "person", "table "). Existing panoptic segmentation methods (Kirillov et al., 2019b; a; Li et al., 2019; Xiong et al., 2019; Lazarow et al., 2020) and instance segmentation approaches (He et al., 2017) deal with a fixed set of category definitions, which are essentially represented by categorical labels without semantic relations. DEtection TRansformer (DETR) (Carion et al., 2020) is a pioneering work that builds a Transformer-based architecture for both object detection and panoptic segmentation. The deep learning field is moving rapidly towards the open-world/zero-shot settings (Bendale & Boult, 2015) where computer vision tasks such as classification (Radford et al., 2021 ), object detection (Li et al., 2022b; Zareian et al., 2021; Zang et al., 2022; Gu et al., 2022; Cai et al., 2022) , semantic labeling (Li et al., 2022a; Ghiasi et al., 2022) , and image retrieval (Bendale & Boult, 2015; Hinami & Satoh, 2018; Zareian et al., 2021; Hinami & Satoh, 2018; Kamath et al., 2021) perform recognition and detection for categories beyond those in the training set. In this paper, we take the advantage of the existence of pre-trained CLIP image and text embedding models (Radford et al., 2021) , that are mapped to the same space. We first build a baseline method for open-vocabulary panoptic segmentation using CLIP models without training. We then develop a new algorithm, MaskCLIP, that is a Transformer-based approach efficiently and effectively utilizing pre-trained dense/local CLIP features without heavy re-training. The key component of MaskCLIP is a Relative Mask Attention (RMA) module that seamlessly integrates the mask tokens with a pretrained ViT-based CLIP backbone. MaskCLIP is distinct and advantageous compared with existing approaches in three aspects: 1) A canonical background and instance segmentation representation by the mask token representation with a unique encoder-only strategy that tightly couples a pre-trained CLIP image feature encoder with the mask token encoder. 2) MaskCLIP avoids the challenging student-teacher distillation processes such as OVR- CNN (Zareian et al., 2021) and ViLD (Gu et al., 2022) that face limited number of teacher objects to train; 3) MaskCLIP also learns to refine masks beyond simple pooling in e.g. OpenSeg (Ghiasi et al., 2022) . The contributions of our work are listed as follows. 



We develop a new algorithm, MaskCLIP, to perform open-vocabulary panoptic segmentation building on top of canonical background and instance mask representation with a cascade mask proposal and refinement process. • We device the MaskCLIP Visual Encoder under an encoder-only strategy by tightly coupling a pre-trained CLIP image feature encoder with the mask token encoder, to allow for the direct formulation of the mask feature representation for semantic/instance segmentation+refinement, and class prediction. Within the MaskCLIP Visual Encoder, there is a new module called Relative Mask Attention (RMA) that performs mask refinement. • MaskCLIP expands the scope of the existing CLIP models to open-vocabulary panoptic segmentation by demonstrating encouraging and competitive results for open-vocabulary-panoptic, instance, and semantic segmentation.

Comparison for recent open-vocabulary approaches for object detection, semantic segmentation,Rao et al., 2022); XPM(Huynh et al., 2022). ✓ ✗ indicates that the corresponding method is loosely following the definition. Dense Clip features refer to the use of pixel-wise/local features. Note that OpenSeg uses itsALIGN (Jia et al., 2021), which is an alternative to CLIP.Open vocabulary.The open vocabulary setting is gaining increasing popularity lately as traditional fully supervised setting cannot handle unseen classes during testing, while real world vision applications like scene understanding, self driving and robotics are commonly required to predict unseen classes. Previous open-vocabulary attempts have been primarily made for object detection. ViLD(Gu  et al., 2022)  trains a student model to distill the knowledge of CLIP. RegionCLIP(Zhong et al., 2022)   finetunes the pretrained CLIP model to match the image areas with corresponding texts. OV-DETR(Zang et al., 2022)  uses CLIP as an external model to obtain the query embedding from CLIP model. Recently there is also work made for open-vocabulary semantic segmentation(Ghiasi et al., 2022).Li et al., 2019; Xiong et al., 2019; Lazarow et al., 2020)  perform training and testing based on a fixed set of category labels. Open-set panoptic segmentation(Hwang et al., 2021)   is an exemplar based approach that requires categories to be known in advance, which is narrower than the open-vocabulary setting where categories of interest can be freely specified in inference.Open-vocabulary panoptic segmentation: an emerging task. As open-set, open-world, zero-shot, and open-vocabulary are relatively new concepts that have no commonly accepted definitions, thus, different algorithms are often not directly comparable with differences in problem definition/setting, training data, and testing scope. Table1gives a summary for the recent open-vocabulary applications. XPM(Huynh et al., 2022)  utilizes vision-language cross modal data to generate pseudo-mask supervision to train a student model for instance segmentation, and thus, it may not be fully openvocabulary to allow for arbitrary object specifications in the inference time.LSeg (Li et al., 2022a)   also has limited open-vocabulary aspect as the learned CNN image features in LSeg are not exposed to representations beyond the training labeling categories. OpenSeg(Ghiasi et al., 2022)  is potentially applicable for instance/panoptic segmentation, but OpenSeg is formulated to be trained on captions which lack instance-level information that is fundamental for panoptic segmentation. The direct image feature pooling strategy in OpenSeg is potentially another limiting factor towards the open-vocabulary panoptic segmentation. Nevertheless, no results for open-vocabulary panoptic/instance segmentation are reported in(Ghiasi et al., 2022).

