OPEN-VOCABULARY PANOPTIC/UNIVERSAL SEGMEN-TATION WITH MASKCLIP

Abstract

In this paper, we tackle an emerging computer vision task, open-vocabulary panoptic segmentation, that aims to perform panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning nor distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamless integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained dense/local CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. We obtain encouraging results for open-vocabulary panoptic/instance segmentation and state-of-the-art results for semantic segmentation on ADE20K and PASCAL datasets. We show qualitative illustration for MaskCLIP with online custom categories.

1. INTRODUCTION

Panoptic segmentation (Kirillov et al., 2019b) or image parsing (Tu et al., 2005) integrates the task of semantic segmentation (Tu, 2008) for background regions (e.g. "stuff" like "road", "sky") and instance segmentation (He et al., 2017) for foreground objects (e.g. "things" such as "person", "table "). Existing panoptic segmentation methods (Kirillov et al., 2019b; a; Li et al., 2019; Xiong et al., 2019; Lazarow et al., 2020) and instance segmentation approaches (He et al., 2017) deal with a fixed set of category definitions, which are essentially represented by categorical labels without semantic relations. DEtection TRansformer (DETR) (Carion et al., 2020) is a pioneering work that builds a Transformer-based architecture for both object detection and panoptic segmentation. The deep learning field is moving rapidly towards the open-world/zero-shot settings (Bendale & Boult, 2015) where computer vision tasks such as classification (Radford et al., 2021) , object detection (Li et al., 2022b; Zareian et al., 2021; Zang et al., 2022; Gu et al., 2022; Cai et al., 2022) , semantic labeling (Li et al., 2022a; Ghiasi et al., 2022) , and image retrieval (Bendale & Boult, 2015; Hinami & Satoh, 2018; Zareian et al., 2021; Hinami & Satoh, 2018; Kamath et al., 2021) perform recognition and detection for categories beyond those in the training set. In this paper, we take the advantage of the existence of pre-trained CLIP image and text embedding models (Radford et al., 2021) , that are mapped to the same space. We first build a baseline method for open-vocabulary panoptic segmentation using CLIP models without training. We then develop a new algorithm, MaskCLIP, that is a Transformer-based approach efficiently and effectively utilizing pre-trained dense/local CLIP features without heavy re-training. The key component of MaskCLIP is a Relative Mask Attention (RMA) module that seamlessly integrates the mask tokens with a pretrained ViT-based CLIP backbone. MaskCLIP is distinct and advantageous compared with existing approaches in three aspects: 1) A canonical background and instance segmentation representation by the mask token representation with a unique encoder-only strategy that tightly couples a pre-trained CLIP image feature encoder with the mask token encoder. 2) MaskCLIP avoids the challenging student-teacher distillation processes such as OVR-CNN (Zareian et al., 2021) and ViLD (Gu et al., 2022) that face limited number of teacher objects to train; 3) MaskCLIP also learns to refine masks beyond simple pooling in e.g. OpenSeg (Ghiasi et al., 2022) . The contributions of our work are listed as follows. • We develop a new algorithm, MaskCLIP, to perform open-vocabulary panoptic segmentation building on top of canonical background and instance mask representation with a cascade mask proposal and refinement process. • We device the MaskCLIP Visual Encoder under an encoder-only strategy by tightly coupling a pre-trained CLIP image feature encoder with the mask token encoder, to allow for the direct formulation of the mask feature representation for semantic/instance segmentation+refinement, and class prediction. Within the MaskCLIP Visual Encoder, there is a new module called Relative Mask Attention (RMA) that performs mask refinement. • MaskCLIP expands the scope of the existing CLIP models to open-vocabulary panoptic segmentation by demonstrating encouraging and competitive results for open-vocabulary-panoptic, instance, and semantic segmentation. instance segmentation, and panoptic segmentation. GLIP (Li et al., 2022b) ; OVR-CNN (Zareian et al., 2021) ; ViLD (Gu et al., 2022) ; RegionCLIP (Zhong et al., 2022) ; OV-DETR (Zang et al., 2022) ; LSeg (Li et al., 2022a) ; OPenSeg (Ghiasi et al., 2022) ; DenseCLIP (Rao et al., 2022) ; XPM (Huynh et al., 2022) . ✓ ✗ indicates that the corresponding method is loosely following the definition. Dense Clip features refer to the use of pixel-wise/local features. Note that OpenSeg uses its ALIGN (Jia et al., 2021) , which is an alternative to CLIP. 

Open vocabulary.

The open vocabulary setting is gaining increasing popularity lately as traditional fully supervised setting cannot handle unseen classes during testing, while real world vision applications like scene understanding, self driving and robotics are commonly required to predict unseen classes. Previous open-vocabulary attempts have been primarily made for object detection. ViLD (Gu et al., 2022) trains a student model to distill the knowledge of CLIP. RegionCLIP (Zhong et al., 2022) finetunes the pretrained CLIP model to match the image areas with corresponding texts. OV-DETR (Zang et al., 2022) uses CLIP as an external model to obtain the query embedding from CLIP model. Recently there is also work made for open-vocabulary semantic segmentation (Ghiasi et al., 2022) . Panoptic segmentation. Existing panoptic segmentation (or image parsing (Tu et al., 2005) ) methods (Kirillov et al., 2019b; a; Li et al., 2019; Xiong et al., 2019; Lazarow et al., 2020) perform training and testing based on a fixed set of category labels. Open-set panoptic segmentation (Hwang et al., 2021) is an exemplar based approach that requires categories to be known in advance, which is narrower than the open-vocabulary setting where categories of interest can be freely specified in inference. Open (Ghiasi et al., 2022) . CLIP model distillation/reuse. After its initial release, the CLIP model (Radford et al., 2021) that is learned from large-scale image-text paired captioning datasets has received a tremendous amount of attention. Some other similar vision-language models have also been proposed later e.g. ALIGN (Jia et al., 2021) , GLIP (Li et al., 2022b) . Many algorithms have been developed lately (Zang et al., 2022; Wang et al., 2022; Zhong et al., 2022; Luo et al., 2021; Patashnik et al., 2021; Shen et al., 2022) trying knowledge distillation from the CLIP model to benefit the down-stream tasks one way or the other by leveraging the rich semantic language information paired in the images. Here, we directly adopt the backbone of CLIP image model to train for open-vocabulary panoptic segmentation. There have been attempts (Rao et al., 2022; Zhou et al., 2022) that use the dense CLIP features to represent pixel-wise feature as teacher model to train student model for semantic segmentation. Our pipeline, shown in Figure 1 , contains two stages. The first stage is a class-agnostic mask proposal network. The second stage is MaskCLIP Visual Encoder which is built on the CLIP (Radford et al., 2021) ViT architecture. It takes the images and the coarse masks from the first stage as the input and will output refined masks along with the corresponding dense image features for further classification using the text embeddings from the CLIP text encoder.

3.1. CLASS-AGNOSTIC MASK PROPOSAL NETWORK

Our Class-Agnostic Mask Proposal Network is built on instance/segmentation models such as MaskRCNN (He et al., 2017) and Mask2Former (Cheng et al., 2022) . To make the model classagnostic, we remove the class supervision during training. The classification head thus becomes a binary classification for either positive or negative in these models.

3.2. MASKCLIP VISUAL ENCODER

Similar to CLIP, our MaskCLIP Visual Encoder also predicts the image features. Unlike the CLIP Visual Encoder that only uses on class token to output the feature of the whole image. Our MaskCLIP Visual Encoder uses another M Mask Class Tokens to output the partial/dense features for each corresponding area of the image given the masks. The Mask Class Tokens use attention masks and Relative Mask Attention to obtain the partial/dense features which we discuss in the following two parts.

3.2.1. MASK CLASS TOKENS

In order to obtain dense image features for the corresponding masks or bounding boxes for further recognition or distillation, an easy way to do this is simply masking or cropping the image and then sending the obtained image to the pretrained image encoder. This method has been widely used in several open vocabulary object detection methods (Zhong et al., 2022; Gu et al., 2022) . The problem is that it's not computation efficient (N masks/boxes will lead to N images and they will be computed through the image encoder independently) and also loses the ability to see the global image context information which is very important for recognizing some objects and stuff. For masking, another problem is that masks are in different shapes and simply masking the image will cause the resulted image to have transparent background which usually doesn't exist in real images that are used for training in large language-vision models e.g. CLIP. To solve this, we propose Mask Class Tokens for efficient feature extraction from images without losing the global image context information. In the original CLIP ViT-based visual encoder framework, the input of the network is N image tokens and 1 class token. The final output of the class token will be used for the relation computation with the text embeddings. Our newly introduced M Mask Class Tokens will be alongside with the image tokens and the class token. The embeddings' weights of the Mask Class Token are provided by the class token in the pretrained CLIP ViT model and are fixed. Each Mask Class Token will output a corresponding dense image feature similar to the class token which outputs the feature of the whole image. To achieve this, we design an attention mask as following M = F (N +1)×(N +1) T (N +1)×M M ′ M ×N F M ×1 T M ×M (1) in which M is the number of Mask Class Tokens, N is the number of image tokens, T m×n is an m × n True matrix, F m×n is an m × n False matrix and M ′ is defined as following: M ′ i,j = False if mask i contains at least one pixel in patch j True otherwise. ( ) where True means that this position is masked out i.e. not allowed to attend and False otherwise. In this way, each Mask Class Token will learn from the corresponding mask area of the images. The image tokens are also interacting with each other which means the global information won't lose. And it's also very efficient since we don't need to do redundant computing for each mask or finetune the pretrained model. However, the mask information are not fully utilized and they cannot be refined either. But we will see in the experiments later that simply adpoting Mask Class Tokens to the pretrained CLIP model without any finetuning will already serve as a competitive baseline. A ′ :M,-N : to the original attention matrix. The newly added attention matrix is computed using the Image Tokens and the Mask Patch Tokens. The mask patch tokens are acquired by patchifying the masks using the similar way for the images as showed here. Moreover, the masks are refined by using Mr in Eq. 5 which is computed by Image Tokens and Mask Class Tokens.

3.2.2. RELATIVE MASK ATTENTION

To further utilize the mask information and refine the coarse masks, we propose Relative Mask Attention mechanism in our transformer. Our key design principle is try not to change the CLIP features directly as this would destroy the learned relationship between the image features and text features in the CLIP model. Therefore, we adopt a way to only change the attention matrix in the transformer to learn a better linear combination of the values in the attention layers according to the mask information. As in Figure 2 , our proposed Relative Mask Attention Mechanism only changes the attention matrix and refines the masks. M r is defined in Eq. 5. A ′ :M,-N : is defined in Eq. 3. f M is the class-agnostic mask proposal network. f 1 and f 2 are two downsampling networks that encode the images/masks to image tokens/mask patch tokens sharing the same architecture. f r is a two-layer convolutional network that maps the attention matrix to a mask residual. Similar to relative positional encoding, we use a relative attention mechanism here. Let D be the dimension of the token embedding, for each Mask Class Token T MC i ∈ R D with a corresponding mask K i ∈ R H×W whose shape is the same as the image, we use a similar way as for the images to get mask patch tokens T MP ∈ R M ×N ×D in the computation of the attention. In our attention matrix, the Mask Class Tokens attending image tokens part will then be as following: A ′ :M,-N : = D c (ϕ Qm (T MP ) ⊙ ϕ Km (T IM )) c (3) A :M,-N : = ϕ Q (T MC ) • ϕ K (T IM ) + A ′ :M,-N : 2 √ D where T IM ∈ R N ×D is image tokens, T MC ∈ R M ×D is Mask Class Tokens, T MP ∈ R M ×N ×D is Mask Patch Tokens ϕ Q , ϕ K , ϕ Qm , ϕ Km are linear transformations, ⊙ is element-wise product and D c (•) c is the sum of the embedding dimension. ϕ Km (T IM ) ∈ R N ×D will first be broadcast to R M ×N ×D before doing element-wise production. The attention will also in turn be used for the refinement of the masks. The vanilla attention can be seen as a relationship between each mask area and all the image patches. Thus we utilize this to help our coarse masks be more accurate. The updating process of the masks is as following: M r = σ(σ -1 (M c ) + f r (ϕ Q (T MC ) ⊙ ϕ K (T IM ))) where M c , M r ∈ R N ×H×W denotes the coarse mask and refined mask respectively, f r is a learnable non-linear function that maps the attention matrix to a mask residual, σ and σ -1 are sigmoid and inverse sigmoid functions respectively.

4. EXPERIMENTS

In this part, we train our proposed MaskCLIP method using COCO (Lin et al., 2014) CLIP Baseline. We design our first baseline by directly using the class-agnostic mask proposal network from the first stage and the pretrained CLIP model. We mask the images according to the masks from the class-agnostic mask proposal network and send the masked images to the CLIP model to get classification results. The pretrained CLIP model we use is ViT-L/14@336px and the text inputs we use are simply the category names defined by each dataset. Those two settings keep the same with the following two methods for fair comparison. MaskCLIP. In our MaskCLIP method, we still use the CLIP ViT-L/14@336px pretrained model as with the previous two. This model has 24 attention layers and we add Relative Mask Attention in four of them which is 6, 12, 18, 24. We use AdamW (Loshchilov & Hutter, 2019) as our optimizer and the learning rate is set to 0.0001. We train our model on COCO training data for 10k iterations with a batchsize of 8. The training takes around 3h on 8 Nvidia A5000 GPUs. Loss Function. The loss function is In this part, we show two sets of images to demonstrate our model capability. The first is the qualitative results on ADE20K. We compare our method with the two baselines in Figure 3 . It can be seen that our method performs much better than the two baselines. The results from the first column show that due to the lack of global information, CLIP baseline failes to predict the floor. L = λ ce L ce + λ dice L dice + λ bce L bce , Instead it predicts skyscraper. This is an easy case but if only floor area is provided it does have some similarities with the wall of a skyscraper. (a) "toy rabbit", "real rabbit", (b) "horse", "donkey", (c) "empty bottle", "filled bottle", "background" "sky", "grass" "door", "wall", "ground" 

4.4. OPEN-VOCABULARY SEMANTIC SEGMENTATION/LABELING

We also use our method to compare with open-vocabulary semantic segmentation as in Table 3 . The setting is similar, they all train on COCO panoptic training set and test on ADE20K validation set. On the four datasets we test, MaskCLIP reaches the state-of-the-art results on three of them with only P-59 slightly lower. To compare with previous methods, we also provide a semantic segmentation comparison in Figure 5 . Results on ALIGN++ and OpenSeg are directly from (Ghiasi et al., 2022) and we run the same image using our MaskCLIP model. It can be seen that due to the open vocabulary setting, some similar classes may be mistakenly classified e.g. all three methods predict the house in this image while the ground truth is building. COCO Split Setting. Besides the cross-dataset setting, we also follow the COCO Split Setting in XPM (Huynh et al., 2022) to perform the instance segmentation in Table 5 . On the generalized setting which is a more challenging setting, we outperforms previous results in base, target and categories. On the constrained setting, we also show competitive results in both base and target categories. 

5. ABLATION STUDY

Incorporating GT Masks. Since our model can decouple the mask proposal process and the classification process, we could also use the ground truth mask proposals which can be regarded as a "perfect" mask proposal network in our method. In this way, we can eliminate the effects of quality of the mask proposals and inspect the method's classification capabilities. In Table 6 . We can see that the performance could gain a lot from the "perfect" mask proposals. And our MaskCLIP method also outperforms OpenSeg in this setting. Mask Refinement. In our Relative Mask Attention part, the attention layer will use the accurate mask information to learn a better attention matrix and the mask will also use the attention information to gradually refine itself. In this ablation study, we only let the attention matrix learn from the mask without any mask refinement. And we get the results in Table 7 . Since the SQ reflects the segemtation quality, we care more about SQ here. It can be seen that MaskCLIP performs slightly better than that without the mask refinement which demonstrates the effectivity of the mask refinement. For each mask: the cropped image region is the element-wise product between the binary mask M i and the image I, i.e. R i ∈ R H×W ×3 ; the visual embedding V i ∈ R D of the cropped region is computed by the visual encoder where D is the hidden dimension; the final classification score Y i ∈ R C is the softmax over the dot product between the visual embedding V i and the text embedding T . A formal algorithm is described as 1 and a visualization of this is shown as 6. 

B ABLATION ON USING RELATIVE MASK ATTENTION IN DIFFERENT LAYERS

In this part we conduct an ablation study on using different layers for relative mask attention. Since our pretrained CLIP model is fixed during the whole training procedure, whether each layer would help the final results remains a question. We use four different kinds of layers combination of the layers in this part and provide the results in Table 8 . We can see that the last layer is a key part of our results since the features are gradually learned through all the attention layer. Though the last four layers' features should the best, the performance wouldn't be better if Relative Mask Attention is only used in the last four layers. This is also reasonable since the network should not have the accurate mask information too late.

C MORE VISUALIZATION RESULTS ON ARBITRARY CATEGORIES

In this part, we provide more visualization results on user-specified class discoveries in Figure 7 . We select some very close text prompts such as "four-leg animal" and "two-leg animal"; "car", "truck" and "SUV" and find that our method can still classify them. We also show another result which is "person identification" in Figure 7 (c) which shows our model preserve the dense/local CLIP features rather well. Table 8 : Ablation Study on Relative Mask Attention Layers in different layers. All the methods are trained on COCO and tested on ADE20K validation dataset. The pretrained CLIP ViT-L/14@336px model has 24 layers and we replace four of them with our relative mask attention to fully utilize the accurate mask information and refine the masks. Different Layers PQ PQ Th PQ St 1, 7, 13, 19 11.241 10.519 12.686 3, 9, 15, 21 11.372 10.141 13.835 21, 22, 23, 24 14.673 14.048 15.922 6, 12, 18, 24 15.121 13.536 18.290 (a) "four-leg animal", "two-leg animal", (b) "car", "truck", "SUV" (c) "Person: Obama", "Person: Biden", "background" "road", "sky" "Person: Trump" 



Figure 1: Illustration of the pipeline. Our pipeline contains two stages. The first stage is a class-agnostic mask proposal network and the second stage is built on the pretrained CLIP ViT model. All the weights of the CLIP ViT model during training are fixed. Arrows in orange denote weight sharing. The embeddings' weights of Mask Class Tokens are shared by Class Tokens in the CLIP ViT model and are fixed. RMA represents Relative Mask Attention which is built based on the CLIP ViT attention layer. RMA contains all the weights from CLIP ViT attention layer which are all fixed during training. Additional weights are added in RMA for further mask information utilization and mask refinement. The demo image we use here is from ADE20K (Zhou et al., 2019).

In our mask attention matrix M, F (N +1)×(N +1) shows the N Image Tokens and one Class Token are attending each other as in the original CLIP. T (N +1)×M shows that the N Image Tokens and one Class Token are not attending the M Mask Class Tokens. M ′ M ×N shows that the Mask Class Tokens are attending the Image Tokens given the corresponding masks. F M ×1 shows that the M Mask Class Tokens are attending the Class Token. T M ×M shows that the M Mask Class Tokens are not interacting with each other.

Figure 2: Relative Mask Attention. Our Relative Mask Attention mechanism adds another attention matrix

MaskCLIP w/o RMA Baseline. Our second baseline is based on the Mask Class Tokens which doesn't use the Relative Mask Attention mechanism. Instead of masking the images and sending the resulted images directly to the CLIP model for feature extraction, we use Mask Class Tokens to acquire the corresponding dense image features. The obtained image features will then be used for further open vocabulary classification. The two baselines above don't need any training in the second stage and can be used to directly perform the open vocabulary tasks. We will demonstrate that the second baseline is better at feature extraction in both quantitative results and qualitative results under the open vocabulary setting and show the effectiveness and efficiency of the proposed Mask Class Tokens.

where L ce is the loss for classification, L dice and L bce are the losses for mask localization. In our experiments, We set λ ce = 2, λ dice = 5, λ bce = 5.In next three parts, we evaluate our methods on open vocabulary panoptic, instance segmentation, and semantic segmentation tasks. The class-agnostic mask proposal networks we use in those methods are trained using Mask2Former other than noted.4.3 OPEN-VOCABULARY PANOPTIC SEGMENTATIONFirst, we compare our MaskCLIP with the two baselines on ADE20K validation set under the open vocabulary panoptic segmentation setting. The results are presented in

Figure 3: Qualitative results on ADE20K panoptic segmentation. The images are taken from the ADE20K validation set. We use the class names directly from the ADE20K 150 classes as the text inpputs. Three images are presented here using our MaskCLIP model along with the two baselines. The second set of images we're presenting are in Figure 4. These figures show our capability of specifying any arbitrary classes in performing panoptic segmentation task. The results show that though we train a new model based on the CLIP model without any distillation methods, we can still preserve the CLIP image features very well. Our model doesn't have a clear bias towards the base classes in the training set and could tell the difference very well that have no chance to learn in the COCO training: e.g toy vs real and filled vs empty.

Figure 4: User-specified class panoptic segmentation. The labels above are the text inputs we used for testing the images. Texts in bold are novel classes i.e. don't exist in the labels of COCO training data. (a) Our model is able to distinguish object properties of real rabbit and toy rabbit. (b) This example shows that our model is potential for fine-grained classifications and does not have bias toward the base classes. (c) Our results show that it can tell the difference between the filled status and empty status of bottles.

Figure 5: Comparison on open-vocabulary semantic segmentation. The input image and the results for GT, ALIGN++, OpenSeg are from(Ghiasi et al., 2022).4.5 OPEN-VOCABULARY INSTANCE SEGMENTATIONCross-Dataset Setting. We present the results on open vocabulary instance segmentation in Table4under the cross-dataset setting. Since instance segmentation can be regarded as "thing-only" panoptic segmentation, we directly apply our model trained on coco panoptic dataset to the instance segmentation task. MaskCLIP with different class-agnostic mask proposal networks perform better than CLIP Baseline and MaskCLIP w/o RMA in general.

Figure 6: Illustration of the CLIP baseline.

Figure 7: More qualitative restuls on user-specified class. The labels above are the text prompts we used for testing the images. Texts in bold are novel classes i.e. don't exist in the labels of COCO training data.

Comparison for recent open-vocabulary approaches for object detection, semantic segmentation,

training data and test on other datasets (ADE20K(Zhou et al., 2019; 2017), PASCAL Context(Mottaghi et al., 2014), LVIS) under the open vocabulary setting. Due to the novel setting of the open vocabulary panoptic segmentation task, we also compare our performance on open vocabulary semantic segmentation with previous methods. Apart from the quantitative results, we also provide qualitative results to validate our method that has good ability to learn dense image features and can support user-specified arbitrary categories. We use the validation set in ADE20K for testing without any training on this dataset in which case we can test our model's capability of open vocabulary segmentation.PASCAL Context: PASCAL Context(Mottaghi et al., 2014) contains 10,103 per-pixel annotations for images of PASCAL VOC 2010(Everingham et al.), where 4998 for training and 5105 for validation. The full version (P-459) includes 459 classes and the short version includes 59 classes. This dataset serves as another benchmark testing our model's open vocabulary segmentation abiltiy. In our first stage, we train a class-agnostic mask proposal network using MaskRCNN(He et al., 2017) and Mask2Former(Cheng et al., 2022) on COCO training data. The experiment setting we use for MaskRCNN is R50-FPN-1x. The backbone we use in Mask2Former is ResNet-50. All the training setting follows the default in their models.

takes ~0.6s which is ~4x faster. Our MaskCLIP beats both baselines significantly as it utilizes the accurate mask information and refines the masks during the feature extraction process.

Results on open-vocabulary panoptic segmentation using the ADE20k validation dataset. th and st represent thing and stuff classes respectively.

Results on open-vocabulary semantic segmentation. A-150 and A-847 represent the ADE20K dataset with 150 classes and 847 classes respectively. P-459 and P-59 represents PASCAL Context dataset with 459 classes and 59 classes respectively. All results use the mIoU metric. All methods presented here don't use extra data other than COCO for training.

Results on open-vocabulary instance segmentation using the ADE20k validation dataset and the LVIS validation dateset under the cross-dataset setting. AP 50 ↑ AP 75 ↑ AP ↑ AP 50 ↑ AP 75 ↑

Results on open-vocabulary instance segmentation under COCO split setting.

Incorporating GT Masks. Results on using GT masks as mask proposals for openvocabulary panoptic segmentation and semantic segmentation.

Ablation Study on Mask Refinement. Results on ADE20K validation set are reported here. Both methods are trained on COCO and tested on ADE20K validation dataset. we provide more details on our CLIP Baseline. Given an RGB image I ∈ R H×W ×3 with height H and width W and a list of category names with C classes, we precompute the text embedding of the category names as E ∈ R C×D . The mask proposal network f m outputs N masks M ∈ R N ×H×W .

D QUALITATIVE RESULTS ON ADE-20K INSTANCE SEGMENTATION

We also show qualitative results on ADE20K comparing the CLIP Baseline, MaskCLIP w/o RMA and MaskCLIP. As shown in Figure 8 , MaskCLIP is much better than the CLIP Baseline and MaskCLIP w/o RMA. Since the mask proposal network is class-agnostic, some masks that are not objects would actually be predicted in which case the classification later will be very important as it may be classified as some object classes in the dataset. The visualization results of MaskCLIP contains fewer non-object masks and are more accurate in class prediction. F EFFICIENCY ANALYSIS.We also provide the FLOPs for differnt models in Table 10 . We use an input resolution of 640x640 here. The CLIP visual encoder we use is ViT-L/14@336px. As we can see, our MaskCLIP w/o RMA and MaskCLIP's FLOPs are much lower than the CLIP Baseline which needs to send each image to the CLIP Visual Encoder. 

