PSEUDOSEG: DESIGNING PSEUDO LABELS FOR SEMANTIC SEGMENTATION

Abstract

Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies. To address this problem, we present a simple and novel re-design of pseudo-labeling to generate well-calibrated structured pseudo labels for training with unlabeled or weaklylabeled data. Our proposed pseudo-labeling strategy is network structure agnostic to apply in a one-stage consistency training framework. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and highdata regimes. Extensive experiments have validated that pseudo labels generated from wisely fusing diverse sources and strong data augmentation are crucial to consistency training for semantic segmentation.

1. INTRODUCTION

Image semantic segmentation is a core computer vision task that has been studied for decades. Compared with other vision tasks, such as image classification and object detection, human annotation of pixel-accurate segmentation is dramatically more expensive. Given sufficient pixellevel labeled training data (i.e., high-data regime), the current state-of-the-art segmentation models (e.g., DeepLabv3+ (Chen et al., 2018) ) produce satisfactory segmentation prediction for common practical usage. Recent exploration demonstrates improvement over high-data regime settings with large-scale data, including self-training (Chen et al., 2020a; Zoph et al., 2020) and backbone pretraining (Zhang et al., 2020a) . In contrast to the high-data regime, the performance of segmentation models drop significantly, given very limited pixel-labeled data (i.e., low-data regime). Such ineffectiveness at the low-data regime hinders the applicability of segmentation models. Therefore, instead of improving high-data regime segmentation, our work focuses on data-efficient segmentation training that only relies on few pixellabeled data and leverages the availability of extra unlabeled or weakly annotated (e.g., image-level) data to improve performance, with the aim of narrowing the gap to the supervised models trained with fully pixel-labeled data. Our work is inspired by the recent success in semi-supervised learning (SSL) for image classification, demonstrating promising performance given very limited labeled data and a sufficient amount of unlabeled data. Successful examples include MeanTeacher (Tarvainen & Valpola, 2017) , UDA (Xie et al., 2019 ), MixMatch (Berthelot et al., 2019b ), FeatMatch (Kuo et al., 2020 ), and FixMatch (Sohn et al., 2020a) . One outstanding idea in this type of SSL is consistency training: making predictions consistent among multiple augmented images. FixMatch (Sohn et al., 2020a) shows that using high-confidence one-hot pseudo labels obtained from weakly-augmented unlabeled data to train strongly-augmented counterpart is the key to the success of SSL in image classification. However, effective pseudo labels and well-designed data augmentation are non-trivial to satisfy for semantic segmentation. Although we observe that many related works explore the second condition (i.e., augmentation) for image segmentation to enable consistency training framework (French et al., 2020; Ouali et al., 2020) , we show that a wise design of pseudo labels for segmentation has great veiled potentials. In this paper, we propose PseudoSeg, a one-stage training framework to improve image semantic segmentation by leveraging additional data either with image-level labels (weakly-labeled data) or without any labels. PseudoSeg presents a novel design of pseudo-labeling to infer effective structured pseudo labels of additional data. It then optimizes the prediction of strongly-augmented data to match its corresponding pseudo labels. In summary, we make the following contributions: • We propose a simple one-stage framework to improve semantic segmentation by using a limited amount of pixel-labeled data and sufficient unlabeled data or image-level labeled data. Our framework is simple to apply and therefore network architecture agnostic. 

2. RELATED WORK

Semi-supervised classification. Semi-supervised learning (SSL) aims to improve model performance by incorporating a large amount of unlabeled data during training. Consistency regularization and entropy minimization are two common strategies for SSL. The intuition behind consistencybased approaches (Laine & Aila, 2016; Sajjadi et al., 2016; Miyato et al., 2018; Tarvainen & Valpola, 2017) is that, the model output should remain unchanged when the input is perturbed. On the other hand, the entropy minimization strategy (Grandvalet & Bengio, 2005) argues that the unlabeled data can be used to ensured classes are well-separated, which can be achieved by encouraging the model to output low-entropy predictions. Pseudo-labeling (Lee, 2013) is one of the methods for implicit entropy minimization. Recently, holistic approaches (Berthelot et al., 2019b; a; Sohn et al., 2020a ) combining both strategies have been proposed and achieved significant improvement. By redesigning the pseudo label, we propose an efficient one-stage semi-supervised learning framework of semantic segmentation for consistency training. Semi-supervised semantic segmentation. Collecting pixel-level annotations for semantic segmentation is costly and prone to error. Hence, leveraging unlabeled data in semantic segmentation is a natural fit. Early methods utilize a GAN-based model either to generate additional training data (Souly et al., 2017) or to learn a discriminator between the prediction and the ground truth mask (Hung et al., 2018; Mittal et al., 2019) . Consistency regularization based approaches have also been proposed recently, by enforcing the predictions to be consistent, either from augmented input images (French et al., 2020; Kim et al., 2020) et al., 2020a; Zoph et al., 2020; Zhu et al., 2020) have been proposed. These methods usually assume the available labeled data is enough to train a good teacher model, which will be used to generate pseudo labels for the student model. However, this condition might not satisfy in the low-data regime. Our proposed method, on the other hand, realizing the ideas of both consistency regularization and pseudo-labeling in segmentation, consistently improves the supervised baseline in both low-data and high-data regimes. Weakly-supervised semantic segmentation. Instead of supervising network training with accurate pixel-level labels, many prior works exploit weaker forms of annotations (e.g., bounding boxes (Dai et al., 2015) , scribbles (Lin et al., 2016) , image-level labels). Most recent approaches use imagelevel labels as the supervisory signal, which exploits the idea of class activation map (CAM) (Zhou et al., 2016) . Since the vanilla CAM only focus on the most discriminative region of objects, dif-



• Directly applying consistency training approaches validated in image classification renders particular challenges in segmentation. We first demonstrate how well-calibrated soft pseudo labels obtained through wise fusion of predictions from diverse sources can greatly improve consistency training for segmentation.• We conduct extensive experimental studies on the PASCAL VOC 2012 and COCO datasets.Comprehensive analyses are conducted to validate the effectiveness of this method at not only the low-data regime but also the high-data regime. Our experiments study multiple important open questions about transferring SSL advances to segmentation tasks.

, perturbed feature embeddings(Ouali et al., 2020), or different networks(Ke et al., 2020). Recently, Luo & Yang (2020) proposes a dual-branch training network to jointly learn from pixel-accurate and coarse labeled data, achieving good segmentation performance. To push the performance of state of the arts, iterative self-training approaches (Chen

availability

//github.com/googleinterns/wss.

