LEARNING A UNIFIED LABEL SPACE

Abstract

How do we build a general and broad object detection system? We use all labels of all concepts ever annotated. These labels span many diverse datasets with potentially inconsistent semantic labels. In this paper, we show how to integrate these datasets and their semantic taxonomies in a completely automated fashion. Once integrated, we train an off-the-shelf object detector on the union of the datasets. This unified recognition system performs as well as dataset-specific models on each training domain, but generalizes much better to new unseen domains. Entries based on the presented methodology ranked first in the object detection and instance segmentation tracks of the ECCV 2020 Robust Vision Challenge. 2 fish OpenImages bear



Figure 1 : Different span diverse semantic and visual domains. We learn to unify the spaces of multiple datasets and train a single object detector that generalizes across datasets.

1. INTRODUCTION

Computer vision aims to produce broad, general-purpose perception systems that work in the wild. Yet object detection is fragmented into datasets (Lin et al., 2014; Neuhold et al., 2017; Shao et al., 2019; Kuznetsova et al., 2020) and our models are locked into specific domains. This fragmentation brought rapid progress in object detection (Ren et al., 2015) and instance segmentation (He et al., 2017) , but comes with a drawback. Single datasets are limited and do not yield general-purpose recognition systems. Can we alleviate these limitations by unifying diverse detection datasets? In this paper, we make training an object detector on the union of disparate datasets as straightforward as training on a single one. The core challenge lies in integrating different datasets into a common taxonomy and label space. A traditional approach is to create this taxonomy by hand (Lambert et al., 2020; Zhao et al., 2020) , which is both time-consuming and error-prone. We present a fully automatic way to unify the output space of a multi-dataset detection system using visual data only. We use the fact that object detectors for similar concepts from different datasets fire on similar novel objects. This allows us to define the cost of merging concepts across datasets, and optimize for a common taxonomy fully automatically. Our optimization jointly finds a unified taxonomy, a mapping from this taxonomy to each dataset, and a detector over the unified taxonomy using a novel 0-1 integer programming formulation. An object detector trained on this unified taxonomy has a large, automatically constructed vocabulary of concepts from all training datasets. We evaluate our unified object detector at an unprecedented scale. We train a unified detector on 4 large and diverse datasets: COCO (Lin et al., 2014 ), Objects365 (Shao et al., 2019) , OpenImages (Kuznetsova et al., 2020), and Mapillary (Neuhold et al., 2017) . Experiments show that our learned taxonomy outperforms the best expert-annotated label spaces, as well as language-based alternatives. For the first time, we show that a single detector performs as well as dataset-specific models on each individual dataset. Crucially, we show that models trained on the diverse training sets generalize zero-shot to new domains, and outperform single-dataset models. Our models ranked first in the object detection and instance segmentation tracks of the ECCV 2020 Robust Vision Challenge across all evaluation datasets. Code and models will be released upon acceptance.

