SEMANTIC PRIOR FOR WEAKLY SUPERVISED CLASS-INCREMENTAL SEGMENTATION

Abstract

Class-incremental semantic image segmentation assumes multiple model updates, each enriching the model to segment new categories. This is typically carried out by providing pixel-level manual annotations for all new objects, limiting the adoption of such methods. Approaches which solely require image-level labels offer an attractive alternative, yet, such annotations lack crucial information about the location and boundary of new objects. In this paper we argue that, since classes represent not just indices but semantic entities, the conceptual relationships between them can provide valuable information that should be leveraged. We propose a weakly supervised approach that leverages such semantic relations in order to transfer some cues from the previously learned classes into the new ones, complementing the supervisory signal from image-level labels. We validate our approach on a number of continual learning tasks, and show how even a simple pairwise interaction between classes can significantly improve the segmentation mask quality of both old and new classes. We show these conclusions still hold for longer and, hence, more realistic sequences of tasks and for a challenging few-shot scenario.

1. INTRODUCTION

When working towards the real-world deployment of artificial intelligence systems, two main challenges arise: such systems should possess the ability to continuously learn, and this learning process should only require limited human intervention. While deep learning models have proved effective in tackling tasks for which large amounts of curated data as well as abundant computational resources are available, they still struggle to learn over continuous and potentially heterogeneous sequences of tasks, especially if supervision is limited. In this work, we focus on the task of semantic image segmentation (SIS). A reliable and versatile SIS model should be able to seamlessly add new categories to its repertoire without forgetting about the old ones. Considering for instance a house robot or a self-driving vehicle with such segmentation capability, we would like it to be able to handle new classes without having to retrain the segmentation model from scratch. Such ability is at the core of continual learning research, the main challenge being to mitigate catastrophic forgetting of what has been previously learned (Parisi et al., 2019) . Figure 1 : Our proposed Relation-aware Prior Loss (RaSP) is based on the intuition that predictions from existing classes provide valuable cues to better segment new, semantically related classes. This allows reducing supervision to image-level labels for incremental SiS. Most learning algorithms for SIS assume training samples with accurate pixel-level annotations, a time-consuming and tedious operation. We argue that this is cumbersome and severely hinders continual learning; adding new classes over time should be a lighter-weight process. This is why, here, we focus on the case where only image-level labels are provided (e.g. , adding the 'sheep' class comes as easily as only providing images guaranteed to contain at least a sheep). This weakly supervised task is an extremely challenging problem in itself and very few attempts have been made in the context of continual learning (Cermelli et al., 2022) . Additionally, we argue that tackling a set of SIS tasks incrementally can bring opportunities to learn to segment new categories more efficiently (allowing to move from pixel-level to image-level labels) and more effectively. This can be enabled by taking into account the semantic relationship between old and new classes -as humans do. In this paper, we formalize and empirically validate a semantic prior loss that fosters such forward transfer. This leads to a continual learning procedure for weakly supervised SIS models that leverages the semantic similarity between class names and builds on top of the model's previous knowledge accordingly. If the model needs to additionally learn about the 'sheep' class for instance, our loss can leverage the model's effectiveness in dealing with other similar animals, such as cows, and does not need to learn it from scratch (see Fig. 1 ). We validate our approach by showing that it seamlessly extends state-of-the-art SIS methods. For all our experiments, we build on the WILSON approach of Cermelli et al. ( 2021), buildt itself on standard techniques for weakly supervised SIS (Borenstein & Ullman, 2004; Araslanov & Roth, 2020) . We extend it with our Relation-aware Semantic Prior (RaSP) loss to encourage forward transfer across classes within the learning sequence. It is designed as a simple-to-adopt regularizer that measures the similarity across old and new categories and reuses knowledge accordingly. To summarize, our contribution is threefold. First, we propose RaSP, a semantic prior loss that treats class labels, not as mere indices, but as semantic entities whose relationship between each other matters. Second, we broaden benchmarks previously used for weakly supervised class-incremental SIS to consider both longer sequences of tasks (prior art is limited to 2, we extend to up to 11), and a few-shot setting, both with image-level annotations only. Finally, we empirically validate that the steady improvement brought by RaSP is also visible in an extended version of our approach that uses an episodic memory, filled with either past samples or web-crawled images for the old classes. We show that, in this context, the memory does not only mitigate catastrophic forgetting, but also and most importantly fosters the learning of new categories.

2. RELATED WORK

This work lies at the intersection of weakly supervised and class-incremental learning of SIS models. Due to the nature of our semantic prior loss, it also relates to text-guided computer vision. Weakly supervised SIS. This term (Borenstein & Ullman, 2004) encompasses several tasks for which SIS models are trained using weaker annotations than the typical pixel-level labels, such as image captions, bounding boxes or scribbles. Methods assuming bounding box annotations for all relevant objects, (or produced by a pretrained detector) focus on segmenting instances within those bounding boxes (Dai et al., 2015; Ji & Veksler, 2021; Song et al., 2019; Kulharia et al., 2020) . More related to our work are the methods that with image-level labels, exploiting classification activation maps (CAM) as pixel-level supervision for SIS (Zhou et al., 2016b; Kolesnikov & Lampert, 2016; Roy & Todorovic, 2017; Huang et al., 2018; Ahn & Kwak, 2018; Araslanov & Roth, 2020) . Class-incremental SIS. Under the hood of continual learning (Parisi et al., 2019) , class-incremental learning consists in exposing a model to sequences of tasks, in which the goal is learning new classes. While most class-incremental learning methods have focused on image classification (Masana et al., 2020) , some recent works started focusing on SiS (Cermelli et al., 2020; Michieli & Zanuttigh, 2021a; Douillard et al., 2021; Maracani et al., 2021; Cha et al., 2021) . Yet, all aforementioned methods assume pixel-level annotations for all the new classes, which requires a huge, often prohibitively expensive amount of human work. Therefore, weakly-supervised class-incremental SIS has emerged as a viable alternative in the pioneering work of Cermelli et al. (2022) , which formalizes the task, and proposes the WILSON method to tackle it. WILSON builds on top of standard weakly supervised SIS techniques, and explicitly tries to mitigate forgetting using feature-level knowledge distillation, akin to the pseudo-labeling approach of PLOP (Douillard et al., 2021) . Text-guided computer vision. Vision and language have a long history of benefiting from each other, and language, a modality that is inherently more semantic, has often been used as a source of supervision to guide computer vision tasks, such as learning visual representations (Quattoni et al., 2007; Gomez et al., 2017; Sariyildiz et al., 2020; Radford et al., 2021) object detection (Shi et al., 2017) , zero-shot segmentation (Zhou et al., 2016b; Bucher et al., 2019; Xian et al., 2019; Li et al., 2020; Baek et al., 2021) , language-driven segmentation (Zhao et al., 2017; Li et al., 2022; Ghiasi et al., 2022; Xu et al., 2022) or referring image segmentation (Hu et al., 2016; Liu et al., 2017;  

