GOOD: EXPLORING GEOMETRIC CUES FOR DETECT-ING OBJECTS IN AN OPEN WORLD

ABSTRACT

We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%. The code has been made available at https://github.com/autonomousvision/good. 

1. INTRODUCTION

The standard object detection task is to detect objects from a predefined class list. However, when deploying the model in the real world, it is rarely the case that the model will only encounter objects from its predefined taxonomy. In the open-world setup, object detectors are required to detect all the objects in the scene even though they have only been trained on objects from a limited number of classes Estimating geometric cues such as depth and normals from a single RGB image has been an active research area for a long time. Such mid-level representations possess built-in invariance to many changes (e.g., brightness, color) and are more class-agnostic than RGB signals, see Figure 2 . In other words, there is less discrepancy between known and unknown objects in terms of geometric cues. In recent years, thanks to stronger architectures and larger datasets (Ranftl et al., 2021b; 2022; Eftekhar et al., 2021) , monocular estimators for mid-level representations have significantly advanced in terms of prediction quality and generalization to novel scenes. These models are able to compute high-quality geometric cues efficiently when used off-the-shelf as pre-trained models on new datasets. Therefore, it becomes natural to ask if these models can provide additional knowledge for current RGB-based open-world object detectors to overcome the generalization problem. In this paper, we propose to use a pseudo-labeling method for incorporating geometric cues into open-world object detector training. We first train an object proposal network on the predicted



Figure 1: Comparison of GOOD with different baselines. Images in the first column are from validation sets of ADE20K (Zhou et al., 2019). From the second to fourth columns we show the detection results of three open-world object detection methods: OLN Kim et al. (2021), GGN Wang et al. (2022), and our Geometry-guided Open-world Object Detector (GOOD). The shown detection results are true-positive proposals from the top 100 proposals of each method. The numbers of true positive proposals or ground truth objects are denoted in parentheses. All models are trained on the RGB images from the PASCAL-VOC classes of the COCO dataset (Lin et al., 2014), which do not include houses, trees, or kitchen furniture. Both OLN and GGN fail to detect many objects not seen during training. GOOD generalizes better to unseen categories by exploiting the geometric cues.

Figure 2: Geometry cues are complementary to appearance cues for object localization. The depth and normal cues of the RGB image are extracted using off-the-shelf general-purpose monocular predictors. Left: Geometric cues abstract away the appearance details and focus on more holistic information such as object shapes and relative spatial locations (depth) and directional changes (normals). Right: By incorporating geometric cues, GOODs generalize better than the RGB-based model OLN (Kim et al., 2021), i.e., much smaller AR gaps between the base and novel classes.

Wang et al., 2022)  to avoid suppressing the unannotated objects in the background, which have led to significant performance improvements. However, these methods still suffer from overfitting the training classes. Training only on RGB images, they mainly rely on appearance cues to detect objects of new categories and have great difficulty generalizing to novel-looking objects. Also, there are known short-cut learning problems with regard to training on RGB images Geirhos et al. (2019; 2020); Sauer & Geiger (2021) -there is no constraint for overfitting the textures or the discriminative parts of the known classes during training. In this work, we propose to tackle this challenge by incorporating geometry cues extracted by general-purpose monocular estimators from the RGB images. We show that such cues significantly improve detection recall for novel object categories on challenging benchmarks.

