OPEN-VOCABULARY PANOPTIC/UNIVERSAL SEGMEN-TATION WITH MASKCLIP

Abstract

In this paper, we tackle an emerging computer vision task, open-vocabulary panoptic segmentation, that aims to perform panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning nor distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamless integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained dense/local CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. We obtain encouraging results for open-vocabulary panoptic/instance segmentation and state-of-the-art results for semantic segmentation on ADE20K and PASCAL datasets. We show qualitative illustration for MaskCLIP with online custom categories.

1. INTRODUCTION

Panoptic segmentation (Kirillov et al., 2019b) or image parsing (Tu et al., 2005) integrates the task of semantic segmentation (Tu, 2008) for background regions (e.g. "stuff" like "road", "sky") and instance segmentation (He et al., 2017) for foreground objects (e.g. "things" such as "person", "table "). Existing panoptic segmentation methods (Kirillov et al., 2019b; a; Li et al., 2019; Xiong et al., 2019; Lazarow et al., 2020) and instance segmentation approaches (He et al., 2017) deal with a fixed set of category definitions, which are essentially represented by categorical labels without semantic relations. DEtection TRansformer (DETR) (Carion et al., 2020) is a pioneering work that builds a Transformer-based architecture for both object detection and panoptic segmentation. The deep learning field is moving rapidly towards the open-world/zero-shot settings (Bendale & Boult, 2015) where computer vision tasks such as classification (Radford et al., 2021) , object detection (Li et al., 2022b; Zareian et al., 2021; Zang et al., 2022; Gu et al., 2022; Cai et al., 2022) , semantic labeling (Li et al., 2022a; Ghiasi et al., 2022) , and image retrieval (Bendale & Boult, 2015; Hinami & Satoh, 2018; Zareian et al., 2021; Hinami & Satoh, 2018; Kamath et al., 2021) perform recognition and detection for categories beyond those in the training set. In this paper, we take the advantage of the existence of pre-trained CLIP image and text embedding models (Radford et al., 2021) , that are mapped to the same space. We first build a baseline method for open-vocabulary panoptic segmentation using CLIP models without training. We then develop a new algorithm, MaskCLIP, that is a Transformer-based approach efficiently and effectively utilizing pre-trained dense/local CLIP features without heavy re-training. The key component of MaskCLIP is a Relative Mask Attention (RMA) module that seamlessly integrates the mask tokens with a pretrained ViT-based CLIP backbone. MaskCLIP is distinct and advantageous compared with existing approaches in three aspects: 1) A canonical background and instance segmentation representation by the mask token representation with a unique encoder-only strategy that tightly couples a pre-trained CLIP image feature encoder with the mask token encoder. 2) MaskCLIP avoids the challenging student-teacher distillation processes such as OVR- CNN (Zareian et al., 2021) and ViLD (Gu et al., 2022) that face limited number of teacher objects to train; 3) MaskCLIP also learns to refine masks beyond simple pooling in e.g. OpenSeg (Ghiasi et al., 2022) . The contributions of our work are listed as follows. 1

