LMSEG: LANGUAGE-GUIDED MULTI-DATASET SEGMENTATION

Abstract

It's a meaningful and attractive topic to build a general and inclusive segmentation model that can recognize more categories in various scenarios. A straightforward way is to combine the existing fragmented segmentation datasets and train a multidataset network. However, there are two major issues with multi-dataset segmentation: (i) the inconsistent taxonomy demands manual reconciliation to construct a unified taxonomy; (ii) the inflexible one-hot common taxonomy causes timeconsuming model retraining and defective supervision of unlabeled categories. In this paper, we investigate the multi-dataset segmentation and propose a scalable Language-guided Multi-dataset Segmentation framework, dubbed LMSeg, which supports both semantic and panoptic segmentation. Specifically, we introduce a pre-trained text encoder to map the category names to a text embedding space as a unified taxonomy, instead of using inflexible one-hot label. The model dynamically aligns the segment queries with the category embeddings. Instead of relabeling each dataset with the unified taxonomy, a category-guided decoding module is designed to dynamically guide predictions to each dataset's taxonomy. Furthermore, we adopt a dataset-aware augmentation strategy that assigns each dataset a specific image augmentation pipeline, which can suit the properties of images from different datasets. Extensive experiments demonstrate that our method achieves significant improvements on four semantic and three panoptic segmentation datasets, and the ablation study evaluates the effectiveness of each component.

1. INTRODUCTION

Image Segmentation has been a longstanding challenge in computer vision and plays a pivotal role in a wide variety of applications ranging from autonomous driving (Levinson et al., 2011; Maurer et al., 2016) to remote sensing image analysis (Ghassemian, 2016) . Building a general and inclusive segmentation model is meaningful to real-world applications. However, due to the limitation of data collection and annotation cost, there are only fragmented segmentation datasets of various scenarios available, such as ADE20K (Zhou et al., 2017) , Cityscapes (Cordts et al., 2016) , COCOstuff (Caesar et al., 2018) , etc. Meanwhile, most work of segmentation (Long et al., 2015; Chen et al., 2018; Zheng et al., 2021) focus on single-dataset case, and overlook the generalization of the deep neural networks. Generally, for different data scenarios, a new set of network weights are supposed to be trained. As a compromise of expensive images and annotations for all scenarios, how to construct a multi-dataset segmentation model with the existing fragmented datasets is attractive for supporting more scenarios. The primary issue of multi-dataset learning is the inconsistent taxonomy, including category coincidence, ID conflict, naming differences, etc. For example, the category of "person" in ADE20k dataset are labeled as "person" and "rider" in Cityscapes dataset. As shown in Figure 1 (a), Lambert et al. ( 2020) manually establish a unified taxonomy with the one-hot label, relabel each dataset, and then train a segmentation model for all involved datasets, which is time-consuming and error-prone.

