GAPS: FEW-SHOT INCREMENTAL SEMANTIC SEG-MENTATION VIA GUIDED COPY-PASTE SYNTHESIS Anonymous

Abstract

Few-shot incremental segmentation is the task of updating a segmentation model, as novel classes are introduced online overtime with a small number of training images. Although incremental segmentation methods exist in the literature, they tend to fall short in the few-shot regime and when given partially-annotated training images, where only the novel class is segmented. This paper proposes a data synthesizer, Guided copy-And-Paste Synthesis (GAPS), that improves the performance of few-shot incremental segmentation in a model-agnostic fashion. Despite the great success of copy-paste synthesis in the conventional offline visual recognition, we demonstrate substantially degraded performance of its naïve extension in our online scenario, due to newly encountered challenges. To this end, GAPS (i) addresses the partial-annotation problem by leveraging copy-paste to generate fully-labeled data for training, (ii) helps augment the few images of novel objects by introducing a guided sampling process, and (iii) mitigates catastrophic forgetting by employing a diverse memory-replay buffer. Compared to existing state-ofthe-art methods, GAPS dramatically boosts the novel IoU of baseline methods on established few-shot incremental segmentation benchmarks by up to 80%. More notably, GAPS maintains good performance in even more impoverished annotation settings, where only single instances of novel objects are annotated.

1. INTRODUCTION

Incremental segmentation is an important capability for open-world AI systems. For example, consider a housekeeping robot that has been trained to segment common household objects, but once deployed in a user's home it encounters a previously unseen type of furniture. For such practical applications, incremental segmentation would be capable of expanding the set of recognized classes to contain the new object. There are a few desired properties of incremental segmentation algorithms to operate under these scenarios. First of all, the algorithm should be equipped with few-shot learning capability, which means that the algorithm can benefit from as few as one image provided by a user rather than requiring hundreds of images annotated offline by professional annotators. Second, providing full segmentation annotation of an image is time-consuming. To avoid causing substantial burdens for untrained users, the algorithm needs to be trainable with partially-annotated images where only novel classes are annotated. A few attempts have been made by recent works (Cermelli et al., 2020; Cha et al., 2021; Douillard et al., 2021; Zhang et al., 2022; Yan et al., 2021) on non-few-shot incremental segmentation to investigate learning with partially-annotated images, which is termed semantic background shift (Cermelli et al., 2020) . Background shift describes a challenge unique to incremental semantic segmentation where classes that are not in the current learning step are assigned 'background' labels, which prohitbits direct end-to-end training. Recent work uses either modified loss (Cermelli et al., 2020; Zhang et al., 2022) or pseudo-labeling (Cha et al., 2021; Douillard et al., 2021; Yan et al., 2021) as proxies to train on partially-annotated images. However, although these proxying methods demonstrate good performance under the non-few-shot settings, they rely on rich annotations and fall short when only a limited amount of data is presented to the model, due to a lack of diversity of data. An even more restrictive setting occurs when users label only a single instance of the novel class, which can dramatically hurt performance of proxy models, due to the training containing non-annotated instances of the novel class (which are treated as negative pixels).

Incremental Learning

Novel-object-only Annotation

Memory-replay Buffer

Fully-labeled Syntheic Samples

Novel Task Updated Model

Copy To address the aforementioned challenges, we propose GAPS (Guided copy-And-Paste Synthesis), which improves the training of incremental segmentation models by synthesizing fully-annotated images from partially-annotated examples. It is model-agnostic, and can be inserted as a plug-andplay module into different incremental learning algorithms, e.g., standard fine-tuning or PIFS (Cermelli et al., 2021) . Copy-paste generates diverse training data to boost performance under few-shot settings, enables the model to learn with partially-annotated images with as few as one annotated novel instance out of many novel instances in an image (e.g., as illustrated at the lower left part of Fig. 1 ), which is a stricter setting than semantic background shift (Cermelli et al., 2020) . To the best of our knowledge, we are the first to introduce copy-paste as a synthesis technique to create a diverse data source for few-shot incremental segmentation. Although copy-paste (Ghiasi et al., 2021) has been shown to be an effective data augmentation technique for offline visual recognition tasks, we identify new key technical challenges to adapting it to few-shot incremental settings. First, how should the synthesizer pick representative samples from the base dataset to construct a diverse pool of fully-annotated base scenes? Second, given the constructed pool of fully-annotated images, how should it select the most suitable base images to be pasted on? Third, after an informative image is selected, from what distribution should it sample current and previously learned novel objects to balance sample frequency and avoid over-sampling or under-sampling? Our GAPS method differs from a naïve (e.g., uniform random sampling) copy-paste process by a guided strategy that considers diversity of the memory-replay buffer, imbalanced class frequencies between base classes and novel classes, and contextual similarity of images. In summary, our contributions are as follow: 1. We are the first to introduce copy-paste as a synthesis technique to address partially-labeled images for incremental segmentation. 2. To address the gaps between copy-paste under the offline setting as an augmentation technique and under the online setting as a synthesis technique, we design a guided copy-paste process that improves the distribution of synthesized images by enforcing diversity of the memory-replay buffer, exploiting contextual information, and balancing class frequencies. 3. The proposed GAPS technique consistently boosts the performance of a variety of incremental learning algorithms from simple fine-tuning to sophisticated state-of-the-arts under



Figure1: proposed method utilizes guided copy-paste augmentation to synthesize diverse training data, using as few as one single novel instance for training. For example, the model encounters an image of many motorcycles, which is novel to the model. As a result, the model incorrectly assigns learned bicycle labels to these pixels and therefore needs to be updated. Our proposed method can adapt to the novel motorcycle class with an annotation of a single motorcycle, which can be efficiently annotated; whereas previous work(Cermelli et al., 2020; 2021)  require time-consuming annotation of all instances of motorcycles or even the entire image. Best view in color.

