VIEWCO: DISCOVERING TEXT-SUPERVISED SEGMEN-TATION MASKS VIA MULTI-VIEW SEMANTIC CONSIS-TENCY

Abstract

Recently, great success has been made in learning visual representations from text supervision, facilitating the emergence of text-supervised semantic segmentation. However, existing works focus on pixel grouping and cross-modal semantic alignment, while ignoring the correspondence among multiple augmented views of the same image. To overcome such limitation, we propose multi-View Consistent learning (ViewCo) for text-supervised semantic segmentation. Specifically, we first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. Additionally, we propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision by contrasting the segment features of Siamese visual encoders. The text-to-views consistency benefits the dense assignment of the visual features by encouraging different crops to align with the same text, while the cross-view segmentation consistency modeling provides additional self-supervision, overcoming the limitation of ambiguous text supervision for segmentation masks. Trained with large-scale image-text data, our model can directly segment objects of arbitrary categories in a zero-shot manner. Extensive experiments show that ViewCo outperforms stateof-the-art methods on average by up to 2.9%, 1.6%, and 2.4% mIoU on PASCAL VOC2012, PASCAL Context, and COCO, respectively.



pair is regarded as a unique positive pair, while all the other combinations are regarded as negative ones. This image-text correspondence is actually too rigorous. In fact, one textual description may correspond to different images. The excessive strictness is not conducive to the model learning highlevel cross-modal semantic correspondences. Therefore, more relaxed vision-language contrastive learning needs to be considered. Second, the ambiguity of textual descriptions is also a key challenge. Compared with the traditional semantic segmentation pipeline that uses dense annotations as supervision information (Touvron et al. ( 2021 This is mainly because compared with traditional segmentation annotations, text descriptions are often more abstract and do not contain location information. Moreover, the background in the image is usually ignored in the description. In some cases, the objects in the image do not even exist in the text description (see Figure 1 ). Such ambiguity is common in the textual supervision in visionlanguage pre-training. In the semantic segmentation task, the ambiguity of textual supervision makes the segmented object-label correspondence very fragile. Therefore, Fully mining the information carried by the dataset itself may need to be considered. 2022)) introduces a visual self-supervised task of reconstructing masked images. They utilize visual self-supervision to provide more useful information for VLP models. However, the semantic consistency of multiple views of an image in segmentation and cross-modal contrast have not received enough attention and research. Based on the above observations, in this paper, we explore the impact of multi-view semantic consistency on the task of text-supervised semantic segmentation through visual self-supervision. To this end, we propose multi-View Consistency learning (ViewCo), which aims at discovering textsupervised segmentation masks via multi-view semantic consistency. Specifically, we propose textto-views consistency modeling to alleviate the excessive strictness of image-text correspondence in vanilla vision-language contrastive learning. It enables the model to benefit from the dense assignment of visual features by encouraging different crops to align with the same text. This relaxed one-to-many contrast mechanism also facilitates the learning of multi-view consistent semantics, enabling the model to acquire high-level cross-modal alignment capabilities. Moreover, as shown in Figure 1 , to alleviate the ambiguity issue of textual supervision, we propose cross-view segmentation consistency modeling. It overcomes the limitation imposed by textual ambiguity by providing additional self-supervision to vision-language contrastive learning via cross-view segmentation consistency. ViewCo uses the proposed text-to-views consistency modeling for vision-language crossmodal contrastive learning and additionally enables cross-view segmentation consistency modeling by contrasting the segment features of Siamese visual encoders. As shown in Figure 2 , with the help of the two consistency modeling schemes, ViewCo establishes a solid semantic correspondence in different views, and the semantics in different views maintain a good consistency. The semantic consistency of GroupViT in different views is difficult to guarantee. Overall, ViewCo's design is simple and effective. We train it on large-scale image-text pair datasets CC12M (Changpinyo et al. (2021) ) and YFCC (Thomee et al. (2016) ). In the inference stage, we



Figure 1: Illustration of text description ambiguity. Text descriptions are highly abstract and difficult to be semantically aligned with images. Crossview semantic consistency modeling can effectively alleviate the effect of the text description ambiguity issue.

Figure 2: The consistent comparison of semantic segmentation results in multiple views of a "horse". (a) GroupViT: the semantic segmentations of different views are inconsistent. (b) ViewCo: the semantic segmentations of different views are much more consistent. Here, x, u, and v represent the segmentation results on the original image and views u, v, respectively.

); Ren et al. (2022)), the CLIP-based segmentation methods (Xu et al. (2022; 2021); Zhou et al. (2021a)) use text as supervision, which is easier to access but more noisy and ambiguous.

On the other hand, visual self-supervision (Caron et al. (2021); He et al. (2022); Chen et al. (2020a); Zhou et al. (2021b)) has been widely used for visual pre-training. It includes two categories: reconstructing masked images (He et al. (2022); Zhou et al. (2021b)) and multicrop image contrast (Caron et al. (2021); Chen et al. (2020a)). For example, SLIP (Mu et al. (2021)) introduces contrastive learning of multicrop visual consistency for VLP. MaskCLIP (Dong et al. (

