LEARNING A UNIFIED LABEL SPACE

Abstract

How do we build a general and broad object detection system? We use all labels of all concepts ever annotated. These labels span many diverse datasets with potentially inconsistent semantic labels. In this paper, we show how to integrate these datasets and their semantic taxonomies in a completely automated fashion. Once integrated, we train an off-the-shelf object detector on the union of the datasets. This unified recognition system performs as well as dataset-specific models on each training domain, but generalizes much better to new unseen domains. Entries based on the presented methodology ranked first in the object detection and instance segmentation tracks of the ECCV 2020 Robust Vision Challenge. 2 fish OpenImages



Figure 1 : Different datasets span diverse semantic and visual domains. We learn to unify the label spaces of multiple datasets and train a single object detector that generalizes across datasets.

1. INTRODUCTION

Computer vision aims to produce broad, general-purpose perception systems that work in the wild. Yet object detection is fragmented into datasets (Lin et al., 2014; Neuhold et al., 2017; Shao et al., 2019; Kuznetsova et al., 2020) and our models are locked into specific domains. This fragmentation brought rapid progress in object detection (Ren et al., 2015) and instance segmentation (He et al., 2017) , but comes with a drawback. Single datasets are limited and do not yield general-purpose recognition systems. Can we alleviate these limitations by unifying diverse detection datasets? In this paper, we make training an object detector on the union of disparate datasets as straightforward as training on a single one. The core challenge lies in integrating different datasets into a common taxonomy and label space. A traditional approach is to create this taxonomy by hand (Lambert et al., 2020; Zhao et al., 2020) , which is both time-consuming and error-prone. We present a fully automatic way to unify the output space of a multi-dataset detection system using visual data only. We use the fact that object detectors for similar concepts from different datasets fire on similar novel objects. This allows us to define the cost of merging concepts across datasets, and optimize for a common taxonomy fully automatically. Our optimization jointly finds a unified taxonomy, a mapping from this taxonomy to each dataset, and a detector over the unified taxonomy using a novel 0-1 integer programming formulation. An object detector trained on this unified taxonomy has a large, automatically constructed vocabulary of concepts from all training datasets. We evaluate our unified object detector at an unprecedented scale. We train a unified detector on 4 large and diverse datasets: COCO (Lin et al., 2014 ), Objects365 (Shao et al., 2019 ), OpenImages (Kuznetsova et al., 2020 ), and Mapillary (Neuhold et al., 2017) . Experiments show that our learned taxonomy outperforms the best expert-annotated label spaces, as well as language-based alternatives. For the first time, we show that a single detector performs as well as dataset-specific models on each individual dataset. Crucially, we show that models trained on the diverse training sets generalize zero-shot to new domains, and outperform single-dataset models. Our models ranked first in the object detection and instance segmentation tracks of the ECCV 2020 Robust Vision Challenge across all evaluation datasets. Code and models will be released upon acceptance. Training on multiple datasets. In recent years, training on multiple diverse datasets has emerged as an effective tool to improve model robustness for depth estimation (Ranftl et al., 2020) and stereo matching (Yang et al., 2019) . In these domains unifying the output space involves modeling different camera models and depth ambiguities. In contrast, for recognition, the unification involves merging different semantic concepts. MSeg (Lambert et al., 2020) manually created a unified label taxonomy of 7 semantic segmentation datasets and used Amazon Mechanical Turk to resolve the inconsistent annotations between datasets. Different from MSeg, our solution does not require any manual effort and unifies the label space directly from visual data in a fully automatic way. Wang et al. ( 2019) train a universal object detector on multiple datasets, and gain robustness by joining diverse sources of supervision. However, they produce a dataset-specific prediction for each input image. When evaluated in-domain, they require knowledge of the test domain. When evaluated out-of-domain, they produce multiple outputs for a single concept. This limits the generalization ability of detection, as we show in experiments (Section. 5.2). Our approach, on the other hand, merges visual concepts at training time and yields a single consistent model that does not require knowledge of the test domain and can be deployed cleanly in new domains. Both Wang et al. 2020) trains a universal detector on multiple datasets: COCO (Lin et al., 2014 ), Pascal VOC (Everingham et al., 2010 ), and SUN-RGBD (Song et al., 2015) , with under 100 classes in total. They manually merge the taxonomies and then train with cross-dataset pseudo-labels generated by dataset-specific models. The pseudo-label idea is complementary to our work. Our unified label space learning removes the manual labor, and works on a much larger scale: we unify COCO, Objects365, and OpenImages, with more complex label spaces and 900+ classes. YOLO9000 (Redmon & Farhadi, 2017) combines detection and classification datasets to expand the detection vocabulary. LVIS (Gupta et al., 2019) extents COCO annotations to > 1000 classes in a federated way. Our approach of fusing multiple readily annotated datasets is complementary and can be operationalized with no manual effort to unify disparate object detection datasets. Zero-shot classification and detection reason about novel object categories outside the training set (Fu et al., 2018; Bansal et al., 2018) . This is often realized by representing a novel class by a semantic embedding (Norouzi et al., 2014) or auxiliary attribute annotations (Farhadi et al., 2009) . In zero-shot detection, Bansal et al. (2018) proposed a statically assigned background model to avoid novel classes being detected as background. Rahman et al. (2019) included the novel class word embedding in test-time training to progressively generate novel class labels. Li et al. (2019) leveraged external text descriptions for novel objects. Our program is complementary: we aim to build a sufficiently large label space by merging diverse detection datasets during training, such that the trained detector transfers well across domains even without machinery such as word embeddings or attributes. Such machinery can be added, if desired, to further expand the model's vocabulary.

3. PRELIMINARIES

An object detector jointly predicts the locations b k ∈ R 4 and classwise detection scores d k ∈ R |L| of all objects in a scene. The detection score describes the confidence that a bounding box belongs to an object with label l ∈ L, where L is the set of all classes. Figure 2a provides an overview. On a single dataset, the detector is trained to produce high scores only for the ground-truth class. Consider multiple datasets, each with its own label space L1 , L2 , . . .. A detector now needs to learn a common label space L for all datasets, and define a mapping between this common label space and dataset-specific labels L → Li . In this work, we only consider direct mappings. Each common label maps to at most one dataset-specific label per dataset, and each dataset-specific label maps to exactly one common label. In particular, we do not hierarchically relate concepts across datasets. When there are different label granularities between datasets, we keep them all in our label space, and expect to predict all of them. Mathematically, the mapping from the joint output space to a dataset-specific one is a Boolean linear transformation of the output of the recognition system



(2019)  andMSeg (Lambert et al., 2020)  observe a performance drop in a single unified model. With our unified label space and a dedicated training framework, this is not the case: the unified model performs as well as single-dataset models on the training datasets.Zhao et al. (

