NOISE OR SIGNAL: THE ROLE OF IMAGE BACK-GROUNDS IN OBJECT RECOGNITION

Abstract

We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds-up to 88% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance.

1. INTRODUCTION

Object recognition models are typically trained to minimize loss on a given dataset, and evaluated by the accuracy they attain on the corresponding test set. In this paradigm, model performance can be improved by incorporating any generalizing correlation between images and their labels into decision-making. However, the actual model reliability and robustness depend on the specific set of correlations that is used, and on how those correlations are combined. Indeed, outside of the training distribution, model predictions can deviate wildly from human expectations either due to relying on correlations that humans do not perceive (Jetley et al., 2018; Ilyas et al., 2019; Jacobsen et al., 2019) , or due to overusing correlations, such as texture (Geirhos et al., 2019; Baker et al., 2018) and color (Yip & Sinha, 2002) , that humans do use (but to a lesser degree). Characterizing the correlations that models depend on thus has important implications for understanding model behavior, in general. Image backgrounds are a natural source of correlation between images and their labels in object recognition. Indeed, prior work has shown that models may use backgrounds in classification (Zhang et al., 2007; Ribeiro et al., 2016; Zhu et al., 2017; Rosenfeld et al., 2018; Zech et al., 2018; Barbu et al., 2019; Shetty et al., 2019; Sagawa et al., 2020; Geirhos et al., 2020) , and suggests that even human vision makes use of image context for scene and object recognition (Torralba, 2003) . In this work, we aim to obtain a deeper and more holistic understanding of how current state-of-the-art image classifiers utilize image backgrounds. To this end, in contrast to most of the prior work (which tends to study relatively small and often newly-curated image datasetsfoot_0 ), our focus is on ImageNet (Russakovsky et al., 2015) -one of the largest and most widely used datasets, with state-of-the-art training methods, architectures, and pre-trained models tuned to work well for it. Zhu et al. (2017) analyze ImageNet classification (focusing on the older, AlexNet model) to find that AlexNet achieves small but non-trivial test accuracy on a dataset consisting of only backgrounds (where foreground objects are replaced by black rectangles). While sufficient for establishing that backgrounds can be used for classification, we aim to go beyond those initial explorations to get a more fine-grained understanding of the relative importance of backgrounds and foregrounds, for newer, state-of-the-art models, and to provide a versatile toolkit for others to use. Specifically, we investigate the extent to which models rely on backgrounds, the implications of this reliance, and how models' use of backgrounds has evolved over time. Concretely: • We create a suite of datasets that help disentangle (and control for different aspects of) the impact of foreground and background signals on classification. The code and datasets are publicly available for others to use in this repository: https://github.com/ MadryLab/backgrounds_challenge. • Using the aforementioned toolkit, we characterize models' reliance on image backgrounds. We find that image backgrounds alone suffice for fairly successful classification and that changing background signals decreases average-case performance. In fact, we further show that by choosing backgrounds in an adversarial manner, we can make standard models misclassify 88% of images as the background class. • We demonstrate that standard models not only use but require backgrounds for correctly classifying large portions of test sets (35% on our benchmark). • We study the impact of backgrounds on classification for a variety of classifiers, and find that models with higher ImageNet test accuracy tend to simultaneously have higher accuracy on image backgrounds alone and have greater robustness to changes in image background.

2. METHODOLOGY

To properly gauge image backgrounds' role in image classification, we construct a synthetic dataset for disentangling background from foreground signal: ImageNet-9. 1 . We label each image with its pre-trained ResNet-50 classification-green, if corresponding with the original label; red, if not. The model correctly classifies the image as "insect" when given: the original image, only the background, and two cases where the original foreground is present but the background changes. Note that, in particular, the model fails in two cases when the original foreground is present but the background changes (as in MIXED-NEXT or ONLY-FG). Base dataset: ImageNet-9. We organize a subset of ImageNet into a new dataset with nine coarse-grained classes and call it ImageNet-9 (IN-9)foot_1 . To create it, we group together ImageNet classes sharing an ancestor in the WordNet (Miller, 1995) hierarchy. We use coarse-grained classes because there are not enough images with annotated bounding boxes (which we need to disentangle backgrounds and foregrounds) to use the standard labels. The resulting IN-9 dataset is class-balanced and has 45405 training images and 4050 testing images. While we can (and do) apply our methods on the full ImageNet dataset as well, we choose to focus on this coarse-grained version of ImageNet because of its higher-fidelity images. We describe the dataset creation process in detail and discuss the advantages of focusing on IN-9 in Appendix A.



We discuss these works in greater detail in Section 5, Related Works. These classes are dog, bird, vehicle, reptile, carnivore, insect, instrument, primate, and fish.



Figure1: Variations of the synthetic dataset ImageNet-9, as described in Table1. We label each image with its pre-trained ResNet-50 classification-green, if corresponding with the original label; red, if not. The model correctly classifies the image as "insect" when given: the original image, only the background, and two cases where the original foreground is present but the background changes. Note that, in particular, the model fails in two cases when the original foreground is present but the background changes (as in MIXED-NEXT or ONLY-FG).

