LMSEG: LANGUAGE-GUIDED MULTI-DATASET SEGMENTATION

Abstract

It's a meaningful and attractive topic to build a general and inclusive segmentation model that can recognize more categories in various scenarios. A straightforward way is to combine the existing fragmented segmentation datasets and train a multidataset network. However, there are two major issues with multi-dataset segmentation: (i) the inconsistent taxonomy demands manual reconciliation to construct a unified taxonomy; (ii) the inflexible one-hot common taxonomy causes timeconsuming model retraining and defective supervision of unlabeled categories. In this paper, we investigate the multi-dataset segmentation and propose a scalable Language-guided Multi-dataset Segmentation framework, dubbed LMSeg, which supports both semantic and panoptic segmentation. Specifically, we introduce a pre-trained text encoder to map the category names to a text embedding space as a unified taxonomy, instead of using inflexible one-hot label. The model dynamically aligns the segment queries with the category embeddings. Instead of relabeling each dataset with the unified taxonomy, a category-guided decoding module is designed to dynamically guide predictions to each dataset's taxonomy. Furthermore, we adopt a dataset-aware augmentation strategy that assigns each dataset a specific image augmentation pipeline, which can suit the properties of images from different datasets. Extensive experiments demonstrate that our method achieves significant improvements on four semantic and three panoptic segmentation datasets, and the ablation study evaluates the effectiveness of each component.

1. INTRODUCTION

Image Segmentation has been a longstanding challenge in computer vision and plays a pivotal role in a wide variety of applications ranging from autonomous driving (Levinson et al., 2011; Maurer et al., 2016) to remote sensing image analysis (Ghassemian, 2016) . Building a general and inclusive segmentation model is meaningful to real-world applications. However, due to the limitation of data collection and annotation cost, there are only fragmented segmentation datasets of various scenarios available, such as ADE20K (Zhou et al., 2017 ), Cityscapes (Cordts et al., 2016) , COCOstuff (Caesar et al., 2018) , etc. Meanwhile, most work of segmentation (Long et al., 2015; Chen et al., 2018; Zheng et al., 2021) focus on single-dataset case, and overlook the generalization of the deep neural networks. Generally, for different data scenarios, a new set of network weights are supposed to be trained. As a compromise of expensive images and annotations for all scenarios, how to construct a multi-dataset segmentation model with the existing fragmented datasets is attractive for supporting more scenarios. The primary issue of multi-dataset learning is the inconsistent taxonomy, including category coincidence, ID conflict, naming differences, etc. For example, the category of "person" in ADE20k dataset are labeled as "person" and "rider" in Cityscapes dataset. As shown in Figure 1 (a), Lambert et al. ( 2020) manually establish a unified taxonomy with the one-hot label, relabel each dataset, and then train a segmentation model for all involved datasets, which is time-consuming and error-prone. Moreover, the one-hot taxonomy is inflexible and unscalable. When extending the datasets or categories, the unified taxonomy demands reconstruction and the model requires retraining. A group of advanced researches (Wang et al., 2022) utilizes multi-head architecture to train a weight-shared encoder-decoder module and multiple dataset-specific headers, as shown in Figure 1 (b). The multihead approach is a simple extension of traditional single-dataset learning, not convenient during inference. For example, to choose the appropriate segmentation head, which dataset the test image comes from needs to be predefined or specified during inference. To cope with these challenges, we propose a language-guided multi-dataset segmentation (LMSeg) framework that supports both semantic and panoptic segmentation tasks (Figure 1(c) ). On the one hand, in contrast to manual one-hot taxonomy, we introduce a pre-trained text encoder to automatically map the category identification to a unified representation, i.e., text embedding space. The image encoder extracts pixel-level features, while the query decoder bridges the text and image encoder and associates the text embeddings with the segment queries. Figure 2 depicts the core of text-driven taxonomy for segmentation. As we can see that the text embeddings of categories reflect the semantic relationship among the classes, which cannot be expressed by one-hot labels. Thus, the text-driven taxonomy can be extended infinitely without any manual reconstruction. On the other hand, instead of relabeling each dataset with a unified taxonomy, we dynamically redirect the model's predictions to each dataset's taxonomy. To this end, we introduce a category-guided decoding (CGD) module to guide the model to predict involved labels for the specified taxonomy. In addition, the image properties of different datasets are various, such as resolution, style, ratio, etc. And, applying appropriate data augmentation strategy is necessary. Therefore, we design a dataset-aware augmentation (DAA) strategy to cope with this. In a nutshell, our contributions are four-fold: • We propose a novel approach for multi-dataset semantic and panoptic segmentation, using text-query alignment to address the issue of taxonomy inconsistency. • To bridge the gap between cross-dataset predictions and per-dataset annotations, we design a category-guided decoding module to dynamically guide predictions to each dataset's taxonomy. • A dataset-aware augmentation strategy is introduced to adapt the optimal preprocessing pipeline for different dataset properties. • The proposed method achieves significant improvements on four semantic and three panoptic segmentation datasets.

2.1. SEMANTIC SEGMENTATION

As a dense prediction task, semantic segmentation plays a key role in high-level scene understanding. Since the pioneering work of fully convolutional networks (FCNs) (Long et al., 2015) , pixel-



Figure 1: Comparison of different multi-dataset segmentation approaches.

