DISENTANGLED FEATURE SWAPPING AUGMENTATION FOR WEAKLY SUPERVISED SEMANTIC SEGMENTA-TION

Abstract

Weakly supervised semantic segmentation utilizes a localization map obtained from a classifier to generate a pseudo-mask. However, classifiers utilize background cues to predict class labels because of a biased dataset consisting of images, in which specific objects frequently co-occur with certain backgrounds. Consequently, the classifier confuses the background with the target objects, resulting in inaccurate localization maps. To this end, we proposed a disentangled feature swapping augmentation method to make the classifier focus more on internal objects other than on the background. Our method first disentangles the foreground and background features. Then, we randomly swap the disentangled features within mini-batches via a two-way process. These features contain various contexts that do not appear in the biased dataset, but the class relevant representation is preserved. In addition, we introduce training schemes to obtain further performance gains. Experimental results showed that when our augmentation method was used in various weakly supervised semantic segmentation methods trained on the Pascal VOC 2012 dataset, the performance of the localization maps and pseudo-mask as well as the segmentation results improved.

1. INTRODUCTION

Semantic segmentation is a task that classifies objects in an image in pixels. A large number of pixellevel labels are required to train a semantic segmentation network. However, acquiring such labels are costly and time consuming. To alleviate this problem, research on weakly supervised semantic segmentation (WSSS) is being actively conducted. WSSS uses a weak label that contains less information about the location of an object than a pixel-level label but has a cheaper annotation cost. Examples of such weaker forms of labels are image-level class labels (Lee et al. (2021a; b; 2022b) 2016)). Among these weak labels, herein, we focus on the image-level class label that is the most accessible and has the lowest annotation cost. Most of the research on WSSS, utilizing image-level class labels, generate pseudo-masks based on localization maps using class activation map (CAM). Therefore, the performance of the segmentation network that uses the pseudo-mask as training data is dependent on the quality of the CAM. However, the classifier trained with a class label confuses the target object with the background, which in turn generates a blurry CAM (Lin et al., 2016) . This is due to the biased dataset composed of images in which the target object frequently co-occurs with a certain background context (Geirhos et al., 2020) . For instance, an object corresponding to the "sheep" class always appears in "grass landscape" and the visual layout is similar in many images. Among the PASCAL VOC 2012 training datasets, more than 20% of the images with "sheep" class contain "grass landscape" as the context. The same goes for cow-grass landscape, boat-water, and aeroplane-sky pairs (Lee et al., 2022a) . Therefore, a classifier trained with a biased dataset depends not only on the target object but also on the bias, such as the background contexts.Because of such misleading correlations, the classifier often assigns higher scores for background regions that are adjacent to objects or fails to activate the target object region, where such objects appear, outside of typical scenes. To mitigate this short-coming, a data augmentation method is required to prevent the classifier from being overfitted to misleading correlations. In this paper, we propose DisEntangled FeaTure swapping augmentation(hereafter we refer to DEFT) method, to alleviate the problem of classifier biased with misleading correlations between the target object and the background. First, we disentangle the feature representation between the target object and the background as these features are highly entangled with each other, causing the classifier to become confused between these cues. To this end, we aggregate information about the target object and background, and use this information with explicitly defined labels to train the classifiers. Then, based on the disentangled representation, in each mini-batch, we randomly swap the background representation, while the foreground representation is fixed and vice versa. The swapped representation is augmented with diverse contextual information. The classifier can focus more on internal objects because the dependency between the object and a specific background is broken. The classifier trained using this augmented representation can effectively suppress the scores on background region and in turn yield a high-quality CAM. The main contribution of this work can be summarized as follows: • We proposed DEFT as a method to alleviate the problem that the classifier trained with the classical augmentation method suffers from false correlations between target objects and backgrounds. • Our proposed DEFT method operates in the feature space, does not require any heuristic decisions or re-training of the network, and can be easily added to other WSSS methods. • When DEFT was applied to other WSSS methods, the quality of the localization map and pseudo-mask generated through the classifier increased, and the performance of the segmentation network on Pascal VOC 2012 also improved. In contrast, the image mix-based method mixes two or more images. Mixup(Zhang et al., 2017) interpolates two images and labels, and CutMix(Yun et al., 2019) replaces a certain region of an image with a patch of another image. However, because the method that uses this regional patch randomly occludes the sub-regions, including both the object and background areas, the classifier



), bounding boxes(Khoreva et al. (2017); Lee et al. (2021c); Song et al. (2019)), points(Bearman et al. (2016); Kim et al. (2022)), and scribbles(Tang et al. (2018); Lin et al. (

Supervised Semantic Segmentation WSSS methods that use image-level class labels generate a localization map based on the initial seed CAM, and then produces a pseudo-mask through an additional refinement process. Because the initial seed identifies only the discriminative regions in the image, numerous studies have been conducted to expand such regions. AdvCAM(Lee et al.,  2021b)  identified more regions of objects by manipulating the attribute map through adversarial climbing of the class scores. DRS(Kim et al., 2021a)  suppresses the most discriminative region, thereby enabling the classifier to capture even the non-discriminative regions. SEAM(Wang et al.,  2020)  regularizes the classifier so that the differently transformed localization maps are equivalent.AMN(Lee et al., 2022b)  leverages a less discriminative part through per-pixel classification. Further, several studies have been conducted to develop feasible methods to prevent the classifier from learning misleading correlations between the target object and the background. SIPE(Chen et al., 2022) captured the object more accurately through a prototype modeling for the background. ICD(Fan et al., 2020a) includes an intra-class discriminator that discriminates the foreground and background within the same class. W-OoD(Lee et al., 2022a) utilizes out-of-distribution data as extra supervision to train the classifier to suppress spurious cues. In addition, various studies have employed a saliency map as an additional supervision or post-processing (Lee et al. (2021e); Fan et al. (2020b); Lee et al. (2019); Wei et al. (2017; 2018); Yao & Gong (2020)). Our proposed method disentangles the background information in the feature space, and thus, no additional supervision is required.Data Augmentation Data augmentation aims to improve the generalization ability of a classifier for unseen data by improving the diversity of the training data. The image erasing method removes one or more sub-regions in an image and replaces them with zero or random values. Cutout(DeVries & Taylor, 2017) randomly masks a specific part of the image, andHide-and-Seek(Singh et al., 2018)   allows the classifier to seek the class relevant features after randomly hiding the patch in the image.

