NOISE OR SIGNAL: THE ROLE OF IMAGE BACK-GROUNDS IN OBJECT RECOGNITION

Abstract

We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds-up to 88% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance.

1. INTRODUCTION

Object recognition models are typically trained to minimize loss on a given dataset, and evaluated by the accuracy they attain on the corresponding test set. In this paradigm, model performance can be improved by incorporating any generalizing correlation between images and their labels into decision-making. However, the actual model reliability and robustness depend on the specific set of correlations that is used, and on how those correlations are combined. Indeed, outside of the training distribution, model predictions can deviate wildly from human expectations either due to relying on correlations that humans do not perceive (Jetley et al., 2018; Ilyas et al., 2019; Jacobsen et al., 2019) , or due to overusing correlations, such as texture (Geirhos et al., 2019; Baker et al., 2018) and color (Yip & Sinha, 2002) , that humans do use (but to a lesser degree). Characterizing the correlations that models depend on thus has important implications for understanding model behavior, in general. Image backgrounds are a natural source of correlation between images and their labels in object recognition. Indeed, prior work has shown that models may use backgrounds in classification (Zhang et al., 2007; Ribeiro et al., 2016; Zhu et al., 2017; Rosenfeld et al., 2018; Zech et al., 2018; Barbu et al., 2019; Shetty et al., 2019; Sagawa et al., 2020; Geirhos et al., 2020) , and suggests that even human vision makes use of image context for scene and object recognition (Torralba, 2003) . In this work, we aim to obtain a deeper and more holistic understanding of how current state-of-the-art image classifiers utilize image backgrounds. To this end, in contrast to most of the prior work (which tends to study relatively small and often newly-curated image datasetsfoot_0 ), our focus is on ImageNet (Russakovsky et al., 2015) -one of the largest and most widely used datasets, with state-of-the-art training methods, architectures, and pre-trained models tuned to work well for it. Zhu et al. (2017) analyze ImageNet classification (focusing on the older, AlexNet model) to find that AlexNet achieves small but non-trivial test accuracy on a dataset consisting of only backgrounds (where foreground objects are replaced by black rectangles). While sufficient for establishing that backgrounds can be used for classification, we aim to go beyond those initial explorations to get a more fine-grained understanding of the relative importance of backgrounds and foregrounds, for newer, state-of-the-art models, and to provide a versatile toolkit for others to use. Specifically, we investigate the extent to which models rely on backgrounds, the implications of this reliance, and how models' use of backgrounds has evolved over time. Concretely: • We create a suite of datasets that help disentangle (and control for different aspects of) the impact of foreground and background signals on classification. The code and datasets



We discuss these works in greater detail in Section 5, Related Works. 1

