VIEWCO: DISCOVERING TEXT-SUPERVISED SEGMEN-TATION MASKS VIA MULTI-VIEW SEMANTIC CONSIS-TENCY

Abstract

Recently, great success has been made in learning visual representations from text supervision, facilitating the emergence of text-supervised semantic segmentation. However, existing works focus on pixel grouping and cross-modal semantic alignment, while ignoring the correspondence among multiple augmented views of the same image. To overcome such limitation, we propose multi-View Consistent learning (ViewCo) for text-supervised semantic segmentation. Specifically, we first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. Additionally, we propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision by contrasting the segment features of Siamese visual encoders. The text-to-views consistency benefits the dense assignment of the visual features by encouraging different crops to align with the same text, while the cross-view segmentation consistency modeling provides additional self-supervision, overcoming the limitation of ambiguous text supervision for segmentation masks. Trained with large-scale image-text data, our model can directly segment objects of arbitrary categories in a zero-shot manner. Extensive experiments show that ViewCo outperforms stateof-the-art methods on average by up to 2.9%, 1.6%, and 2.4% mIoU on PASCAL VOC2012, PASCAL Context, and COCO, respectively. 1

1. INTRODUCTION

Recently, vision-language contrastive learning (Radford et al. (2021) ; Li et al. (2021a) ) has attracted a lot of attention because it can obtain more generalized feature representation. And at the same time, it can also make use of abundant image-text pairs to avoid labor-intensive annotation costs. Vision-language pre-training (VLP) models have exhibited pair is regarded as a unique positive pair, while all the other combinations are regarded as negative ones. This image-text correspondence is actually too rigorous. In fact, one textual description may correspond to different images. The excessive strictness is not conducive to the model learning highlevel cross-modal semantic correspondences. Therefore, more relaxed vision-language contrastive learning needs to be considered. Second, the ambiguity of textual descriptions is also a key challenge. Compared with the traditional semantic segmentation pipeline that uses dense annotations as supervision information (Touvron et al. (2021) ; Ren et al. (2022) ), the CLIP-based segmentation methods (Xu et al. (2022; 2021) ; Zhou et al. (2021a) ) use text as supervision, which is easier to access but more noisy and ambiguous. This is mainly because compared with traditional segmentation annotations, text descriptions are often more abstract and do not contain location information. Moreover, the background in the image is usually ignored in the description. In some cases, the objects in the image do not even exist in the text description (see Figure 1 ). Such ambiguity is common in the textual supervision in visionlanguage pre-training. In the semantic segmentation task, the ambiguity of textual supervision makes the segmented object-label correspondence very fragile. Therefore, Fully mining the information carried by the dataset itself may need to be considered. On the other hand, visual self-supervision (Caron et al. (2021) ; He et al. (2022) ; Chen et al. (2020a) ; Zhou et al. (2021b) ) has been widely used for visual pre-training. It includes two categories: reconstructing masked images (He et al. (2022) ; Zhou et al. (2021b) ) and multicrop image contrast (Caron et al. (2021) ; Chen et al. (2020a) ). For example, SLIP (Mu et al. (2021) ) introduces contrastive learning of multicrop visual consistency for VLP. MaskCLIP (Dong et al. (2022) ) introduces a visual self-supervised task of reconstructing masked images. They utilize visual self-supervision to provide more useful information for VLP models. However, the semantic consistency of multiple views of an image in segmentation and cross-modal contrast have not received enough attention and research. Based on the above observations, in this paper, we explore the impact of multi-view semantic consistency on the task of text-supervised semantic segmentation through visual self-supervision. To this end, we propose multi-View Consistency learning (ViewCo), which aims at discovering textsupervised segmentation masks via multi-view semantic consistency. Specifically, we propose textto-views consistency modeling to alleviate the excessive strictness of image-text correspondence in vanilla vision-language contrastive learning. It enables the model to benefit from the dense assignment of visual features by encouraging different crops to align with the same text. This relaxed one-to-many contrast mechanism also facilitates the learning of multi-view consistent semantics, enabling the model to acquire high-level cross-modal alignment capabilities. Moreover, as shown in Figure 1 , to alleviate the ambiguity issue of textual supervision, we propose cross-view segmentation consistency modeling. It overcomes the limitation imposed by textual ambiguity by providing additional self-supervision to vision-language contrastive learning via cross-view segmentation consistency. ViewCo uses the proposed text-to-views consistency modeling for vision-language crossmodal contrastive learning and additionally enables cross-view segmentation consistency modeling by contrasting the segment features of Siamese visual encoders. As shown in Figure 2 , with the help of the two consistency modeling schemes, ViewCo establishes a solid semantic correspondence in different views, and the semantics in different views maintain a good consistency. The semantic consistency of GroupViT in different views is difficult to guarantee. use the similarity scores between the segmentation embeddings generated by the teacher network and the label prompts to assign labels to the image masks for zero-shot semantic segmentation. Compared with the state-of-the-art methods, ViewCo achieves an average improvement of 2.9%, 1.6%, and 2.4% mIoU on PASCAL VOC2012, PASCAL Context, and COCO, respectively. Our contributions can be summarized as follows: • We propose a novel one-to-many text-to-views consistency modeling that improves the model's ability of high-level cross-modal semantic alignment by encouraging different crops of an image to align with the same text. • To alleviate the problem of supervision failure that may arise from text ambiguity, we propose cross-view segmentation consistency modeling to provide additional self-supervision for the vision branch and encourage the model to generate consistent segmentation masks for different views. • ViewCo consistently outperforms the state-of-the-art methods on PASCAL VOC2012, PASCAL Context, and MS-COCO when pre-trained on CC12M or CC12M+YFCC. 

2. RELATED WORK

E I E I E I E I E I E I E I E T E T E T E T ഥ E I

3. MULTI-VIEW CONSISTENT LEARNING

As shown in Figure 4 , our ViewCo is mainly composed of a cross-view segmentation consistency module and a text-to-views consistency module. We describe these two modules in Sections 3.1 and 3.2, respectively, and summarize the final loss function in Section 3.3.

3.1. CROSS-VIEW SEGMENTATION CONSISTENCY MODULE

As shown in Figure 4 (left), given a batch of image-text pairs {(x I i , x T i )} B i=1 , two random augmentations are performed on the input image x I i , generating two warped views u and v. We use GroupViT (Xu et al. (2022) ) as the bottom-up segmentation backbone of ViewCo, where each view is segmented into K segment tokens. For each of the views (e.g., u), this process is expressed as:  and d is the dimensionality of the segment feature. Similarly, we have Z vs seg k and the segment features Z ut seg k and Z vt seg k from the teacher network f t . We update the parameters of f t using the exponential moving average (EMA) He et al. (2020b) of the parameters of f s . For example, let θ i and θ i be the parameters of f s and f t at training step i, respectively, and then θ i is updated as: Z us Seg = {Z us seg k , k = 1, ..., K} = f s (u) ∈ R K×d , where Z us seg k ∈ R d is the k-th segment feature from f s , θ i = αθ i-1 + (1 -α)θ i , where α is a hyper-parameter for smoothing the update. In addition, the standard contrastive loss function, called InfoNCE (Oord et al. ( 2018)), is considered in this paper, for an encoded query q and a set of encoded samples k = {k 0 , k 1 , k 2 , ...} N that are the keys of a dictionary, we have: L NCE (q, k) = -log exp(q • k + /τ ) N i=0 exp(q • k i /τ ) , ( ) where τ is a learnable temperature parameter. And q and k + are positive pairs, and the other (N -1) pairs are negative. Intuitively, the segment features obtained from different crops of the same image should be roughly the same, i.e., cross-view segmentation consistency. To this end, for the semantic segmentation task, we replace the image-level contrastive learning in previous methods (Caron et al. (2021) ; Zhou et al. (2021b) ) with cross-view segmentation consistency learning within images. Therefore, we define the minimization training objective of the cross-view segmentation consistency module in ViewCo as: L t↔s [Seg] = L t→s [Seg] + L s→t [Seg] . It is a bi-directional contrast loss between the segment features from the teacher f t and the student f s . L t→s [Seg] considers two pairs of views (i.e., (u t , v s ) and (v t , u s )) outputted by f t and f s . The segment features of (u t , v s ) from the same image are multiplied (Z ut seg •Z vs seg T ∈ R K×K ) after l 2 normalization. In the image branch of ViewCo, we use the EMA policy for parameter updates, so the learnable grouping tokens on the corresponding position IDs of different views of the same image are highly correlated, and they have the same semantics. Therefore, the semantic pairs {(Z ut seg i , Z vs seg i ), i = 1, ..., K} on the diagonal are regarded as positive, and the other K(K-1) pairs {(Z ut seg i , Z vs seg j ), i, j = 1, ..., K, i ̸ = j} are regarded as negative. Therefore, the contrastive loss L t→s [Seg] of the teacher-tostudent segment features is defined as L t→s [Seg] = L ut→vs + L vt→us , more specifically: L t→s [Seg] = - 1 KB B i=1 K k=1 (L NCE (Z ut seg k , {Z vs seg k } K k=1 ) + L NCE (Z vt seg k , {Z us seg k } K k=1 )). Similarly, the contrastive loss L s→t [Seg] of the student-to-teacher segment features is defined as L s→t [Seg] = L us→vt + L vs→ut , more specifically: L s→t [Seg] = - 1 KB B i=1 K k=1 (L NCE (Z us seg k , {Z vt seg k } K k=1 ) + L NCE (Z vs seg k , {Z ut seg k } K k=1 )). Figure 5a shows the positive and negative pairs for cross-view segmentation consistency learning in the vision branch.

3.2. TEXT-TO-VIEWS CONSISTENCY MODULE

Previous methods (Radford et al. (2021) ; Xu et al. ( 2022)) build visual-linguistic semantic correspondences by performing a contrastive loss on image-text pairs. In this paper, we consider the contrastive learning between multiple views and text, using one-to-many text-to-views consistency modeling instead of one-to-one text-to-image contrastive learning. The model learns to capture intra-modal and inter-modal semantic consistency through the alignment of multi-view images and text. Specifically, for a given image-text pair (x I i , x T i ), by performing two different augmentations to the input image, we have a triplet (u i , v i , x T i ). As shown in Figure 4 (right), in the training phase, we take the output (Z u i , Z v i ) of the view pair (u i , v i ) through the student network f s and the output Z T i of the text encoder E T to calculate the contrastive loss respectively. The visual embeddings (Z u i , Z v i ) and text embedding Z T i are mapped to the same feature space through two MLPs, respectively, before performing the final l 2 regularization. This procedure is represented as: Published as a conference paper at ICLR 2023 Z Iu i = MLP(AvgPool(Z ui [Seg] )), Z ui [Seg] = f s (u i ); Z Iv i = MLP(AvgPool(Z vi [Seg] )), Z vi [Seg] = f s (v i ). The multi-view feature Z I i = {Z Iu i , Z Iv i } and text embedding Z T i constitute positive pairs, and the other 2B(B -1) pairs are negative pairs. The contrastive loss of text-to-views consistency modeling is defined as follows: L I {u,v} ↔T = L I {u,v} →T + L T →I {u,v} , (5) where the contrastive loss of views I {u,v} -to-text is defined as: L I {u,v} →T = - 1 KB B i=1 K k=1 (L NCE (Z Iu i , {Z T i } B i=1 ) + L NCE (Z Iv i , {Z T i } B i=1 )). and the contrastive loss of text-to-views I {u,v} is defined as: L T →I {u,v} = - 1 KB B i=1 K k=1 (L NCE (Z T i , {Z Iu i } B i=1 ) + L NCE (Z T i , {Z Iv i } B i=1 )). Additionally, in order to further enhance the association between multi-view semantics and text semantics, we also compute the multi-label image-text contrastive loss (Xu et al. (2022) ) of multiview and "prompted text" pairs {(Z Iu i , {Z tm i } M m=1 ) B i=1 , (Z Iv i , {Z tm i } M m=1 ) B i=1 }, where {Z tm i } M m=1 are the embeddings of the additional M text prompts {T m i } M m=1 generated by the i-th text x T i according to the "prompt engineering" mechanism (Radford et al. (2021) ). (Z Iu i , {Z tm i } M m=1 ), i.e., the embedding of the i-th image view u and the generated M text embeddings {Z tm i } M m=1 are positive pairs, and the other combinations are negative pairs. Therefore, similar to Eq.( 5), the multi-label contrastive loss of multi-view I {u,v} and multi-prompt {T m } M m=1 is defined as: L I {u,v} ↔{T m } M m=1 = L I {u,v} →{T m } M m=1 + L {T m } M m=1 →I {u,v} . First, the views-to-prompts loss is the average of the losses of the two views. Considering a single view, e.g. u, the contrastive loss of u to all the prompts is defined as: L Iu→{T m } M m=1 = - 1 B B i=1 log M m=1 exp(Z Iu i • Z tm i /τ ) M m=1 B j=1 exp(Z Iu i •Z tm j /τ ) . Second, the contrastive loss of multi-prompt-to-views is defined as: L {T m } M m=1 →I {u,v} = - 1 2M B M m=1 B i=1 (LNCE(Z tm i , {Z Iu i } B i=1 ) + LNCE(Z tm i , {Z Iv i } B i=1 )). In particular, a similar work to our text-to-views consistency module is DeCLIP (Li et al. (2021b) ). It believes that the text description may only be a small part of the image, so in addition to the global view in CLIP (Radford et al. (2021) ), DeCLIP also adds a local view for image self-supervision, which may cause information leakage. In addition, DeCLIP uses EDA (Wei & Zou (2019) ) as a text augmentation strategy. The augmented text still contains multiple semantics, which is not helpful to the alignment of local semantics in segmentation tasks. In contrast, ViewCo uses self-supervision of two local views to ensure the difficulty of the task, while using a "prompt engineering" mechanism to obtain an augmented text with a single semantic. Combining one-to-many alignment can help ViewCo to better mine consistent segmentation semantics in images.

3.3. OVERALL LOSS FUNCTION

Finally, the total loss of ViewCo is the sum of the cross-view segmentation consistency contrastive loss and the two cross-modal contrastive losses: L = L t↔s [Seg] + L I {u,v} ↔T + L I {u,v} ↔{T m } M m=1 .

4.1. IMPLEMENTATION DETAILS

Architecture. In the cross-view segmentation consistency module, f t and f s have the same network structure. The parameters of f t are updated using the exponential moving average of the parameters ple, in image x, "umbrella" is misclassified as "cow", and in view u, "umbrella" is misclassified as "horse". There is also the problem of inconsistent semantic segmentations between views u and v. As shown in Figure 2 (b), the semantic segmentation in different views in ViewCo is completely consistent. This shows that our cross-view segmentation consistency modeling and text-to-views consistency modeling in ViewCo are effective.

Pre-training dataset

Zero-shot Acc@1 (%) Acc@5 (%) To evaluate ViewCo's ability to perform semantic segmentation through semantic understanding in rare scenes, we show the more visual comparison in Figure 6 . The images of rare scenes are selected from the Internet. In Image Classification. We also evaluate the classification performance of ViewCo. As shown in Table 5 , ViewCo significantly outperforms ViT (i.e., 46.3% vs. 42.4%) and GroupViT (i.e., 46.3% vs. 42.9%) , showing that ViewCo achieves better cross-modal semantic alignment through text-toviews consistency modeling.

5. CONCLUSION

We propose a novel and simple multi-view consistency learning (ViewCo) for text-supervised semantic segmentation. To deal with the problems of excessively strict image-text correspondence and ambiguous text supervision in the VLP model, ViewCo models the text-to-views consistency and cross-view segmentation consistency. ViewCo can generate consistent segmentations and better capture high-level cross-modal semantic alignment. We expect that this exploration of multi-view consistent learning is also applicable to other VLP tasks. 



The latest version of GroupViT reports the results of training on the CC3M+CC12M+YFCC dataset. https://www.mindspore.cn



Figure 1: Illustration of text description ambiguity. Text descriptions are highly abstract and difficult to be semantically aligned with images. Crossview semantic consistency modeling can effectively alleviate the effect of the text description ambiguity issue.

Figure 2: The consistent comparison of semantic segmentation results in multiple views of a "horse". (a) GroupViT: the semantic segmentations of different views are inconsistent. (b) ViewCo: the semantic segmentations of different views are much more consistent. Here, x, u, and v represent the segmentation results on the original image and views u, v, respectively.

Vision-Language Pretraining. In recent years, vision-language pre-training models(Chen et al.  (2020b);Desai & Johnson (2021);Li et al. (2020a; 2021a; 2020b)) have developed rapidly with the help of large-scale image-text pair data available on the Internet. Recently, VLP models such as CLIP(Radford et al. (2021)), ALIGN(Li et al. (2021a)), and SLIP(Mu et al. (2021)) have made great progress in visual representation learning by using contrastive learning. And they have been successfully transferred to various downstream tasks, such as visual question answering(Antol et al. (2015);Zhou et al. (2020)) and visual reasoning(Zellers et al. (2019)). In particular, CLIP(Radford et al. (2021)) uses the image-text matching relationship for contrastive learning, and the learned model can be directly transferred to ImageNet classification(Deng et al. (2009)) in a zero-shot manner without any fine-tuning. This success is also found on zero-shot semantic segmentation(Xu et al. (2022)). However, the one-to-one contrastive learning mechanism between image and text in the vanilla VLP pipeline is too strict, which is not conducive to the model learning high-level crossmodal semantic alignment. Based on the above observations, this paper proposes one-to-many textto-views consistency modeling. It relaxes the original one-to-one correspondence by encouraging different crops of an image to match the same text, allowing the model to benefit from the dense assignment of the visual features.Visual Self-Supervision. This framework relies on the information carried by the image itself for self-supervision without any additional annotation information. Visual self-supervision is mainly divided into generative(He et al. (2022); Bao et al. (2021)) and contrastive (He et al. (2020a); Caron et al. (2021); Chen et al. (2020a)). A generative model allows the model to learn the feature representation of the image by reconstructing the masked image. Contrastive models focus more on learning-centric global representations. Since semantic segmentation requires dense prediction of images, generative models may not help much because E IContrastive Loss"A woman using her tablet while it's snowing."

Figure 3: Comparison of single-view text-to-image contrastive learning (left) and multi-view text-to-views contrastive learning (right).

Figure 4: Framework of ViewCo. It is mainly composed of a cross-view segmentation consistency module and a text-to-views consistency module. The visual branch adopts a visual self-supervised model, which consists of teacher ft and student fs networks with the same structure. ft and fs are the bottom-up segmentation backbone that outputs segment features of the image. also the core idea of visual self-supervised contrastive learning (Caron et al. (2021); Chen et al. (2020a); He et al. (2020a)). For example, DenseCL (Wang et al. (2021a)) performs pixel-level dense contrastive learning on dense output vectors from multiple views, which is not helpful to the learning of high-level global semantic information. Further, GroupViT(Xu et al. (2022)) uses text as supervision and achieves pixel grouping by capturing the contextual consistent semantics of images. However, in the text-supervised semantic segmentation task, the ambiguous properties of text relative to dense annotations result in that the semantic consistency of images sharing the same semantics cannot be sufficiently guaranteed in the embedding space. Furthermore, the strict one-toone correspondence between image and text in the vanilla VLP model is also not conducive to the true alignment of high-level cross-modal semantics. Figure3(left) illustrates the above observation: although one of the views of an image (e.g., the solid circle) is already close to the corresponding text embedding, other views (e.g., the dashed circles) may still be far away. Previous VLP methods generally only focus on the alignment of a single view with text. In contrast, as shown in Figure3(right), ViewCo focuses on text-to-views consistency modeling, doing one-to-many matching in cross-modal contrastive learning.

Figure 5: Illustration of the contrastive loss of (a) cross-view segmentation consistency modeling and (b) textto-views consistency modeling. Z I iSeg k is the k-th semantic feature of the i-th image (i.e., view u or v). Z Iv i and Z Iu i are the embeddings of the views v and u of the i-th image, respectively.

Figure 6: Comparison of semantic segmentation of images in rare scenes. (a) Image segmentation and semantic prediction. (b) Image segmentation. ViewCo can better learn high-level cross-modal semantic alignment with the help of two consistency modeling schemes.

, we use the class labels of the PASCAL VOC 2012 dataset as the label set for the images. ViewCo's segmentation and prediction results in rare scenes are significantly better than GroupViT's. This indicates that ViewCo can better understand high-level semantics in images through consistent semantic learning. In Figure6(b), we only focus on the model's ability to segment images in rare scenes. Compared to GroupViT, ViewCo handles the details of image segmentation much better.More visual comparison results are shown in Figure7of A.2 of the supplementary material. In addition, we also visually compare the segmentation consistency of ViewCo and GroupViT on different views in A.3. Finally, we present an analysis of ViewCo's cross-view segmentation consistency in A.4.

Zero-shot performance on ImageNet.

This work was supported in part by National Key R&D Program of China under Grant No.2020AAA0109700, National Natural Science Foundation of China (NSFC) under Grant No.61976233, Guangdong Outstanding Youth Fund (Grant No.2021B1515020061), Shenzhen Fundamental Research Program (Project No.RCYX20200714114642083, No.JCYJ20190807154211365). We thank MindSpore and CAAI-Huawei MindSpore Open Fund for the partial support of this work, which is a new deep learning computing framwork 2 .

4.2. COMPARISONS WITH RECENT METHODS

We first compare the performance of ViewCo with some ViT-S-based zero-shot baselines. Then, to further evaluate the performance of ViewCo on the zero-shot semantic segmentation task, we compare ViewCo with some fully supervised transfer and CLIP-based models. Image-Level Contrast vs. Semantic-Level Contrast. To ablate the role of the cross-view segmentation consistency module in the vision branch, we add an image-level contrastive module to GroupViT in the visual branch, where we first calculate the average of the K segment tokens outputted by the teacher and student networks, and then perform contrastive learning. For ViewCo, we remove the text-to-views consistency module and directly average pool the multi-view features outputted by the student network. To be consistent with GroupViT, we use the pooled visual features for contrastive learning with text embeddings. As shown in Table 3 , adding a visual self-supervised module for vision-language contrastive learning can improve the performance of the model on semantic segmentation by improving the quality of visual feature learning. Furthermore, the improved performance (i.e., 19.1 vs. 18.6) of semantic-level learning relative to image-level contrastive learning suggests that the cross-view segmentation consistency module can further improve the performance by capturing the consistency of cross-view semantic segmentation. Vision-Language Contrast: Text-to-Image vs.Text-to-Views. We further ablate the text-to-views consistency module in ViewCo. In single-view vision-language contrastive learning, we use the average embedding of multi-view features outputted by the student network and the text embedding for contrastive learning during training. As shown in Table 4 , text-to-views consistency modeling significantly improves the performances of the models compared to single-view text-to-image (i.e., 1.1% and 1.5%). This indicates that text-toviews consistency modeling has better high-level semantic alignment capabilities than text-to-image single-view modeling. This is exactly what previous methods of single-view vision-language contrastive learning do not have.Qualitative Analysis. Figure 2 shows some visualization results of multi-view semantic segmentation consistency for ViewCo and GroupViT. As shown in Figure 2 (a), in GroupViT, the semantic segmentations of different views from the same image are inconsistent. For exam-

