TOWARDS A MORE RIGOROUS SCIENCE OF BLINDSPOT DISCOVERY IN IMAGE MODELS Anonymous authors Paper under double-blind review

Abstract

A growing body of work studies Blindspot Discovery Methods (BDMs): methods for finding semantically meaningful subsets of the data where an image classifier performs significantly worse, without making strong assumptions. Motivated by observed gaps in prior work, we introduce a new framework for evaluating BDMs, SpotCheck, that uses synthetic image datasets to train models with known blindspots and a new BDM, PlaneSpot, that uses a 2D image representation. We use SpotCheck to run controlled experiments that identify factors that influence BDM performance (e.g., the number of blindspot in a model) and show that PlaneSpot outperforms existing BDMs. Importantly, we validate these findings using real data. Overall, we hope that the methodology and analyses presented in this work will serve as a guide for future work on blindspot discovery.

1. INTRODUCTION

A growing body of work has found that models with high test performance can still make systemic errors, which occur when the model performs significantly worse on a semantically meaningful subset of the data (Buolamwini & Gebru, 2018; Chung et al., 2019; Oakden-Rayner et al., 2020; Singla et al., 2021; Ribeiro & Lundberg, 2022) . For example, past works have demonstrated that models trained to diagnose skin cancer from dermoscopic images sometimes rely on spurious artifacts (e.g., surgical skin markers that some dermatologists use to mark lesions); consequently, they have different performance on images with or without those spurious artifacts (Winkler et al., 2019; Mahmood et al., 2021) . More broadly, finding systemic errors can help us detect algorithmic bias (Buolamwini & Gebru, 2018) or sensitivity to distribution shifts (Sagawa et al., 2020; Singh et al., 2020) . In this work, we focus on what we call the blindspot discovery problem, which is the problem of finding an image classification model's systemic errorsfoot_0 without making many of the assumptions considered in related works (e.g., we do not assume access to metadata to define semantically meaningful subsets of the data, tools to produce counterfactual images, a specific model structure or training process, or a human in the loop). We call methods for addressing this problem Blindspot Discovery Methods (BDMs) (e.g., Kim et al., 2019; Sohoni et al., 2020; Singla et al., 2021; d'Eon et al., 2021; Eyuboglu et al., 2022) . We note that blindspot discovery is an emerging research area and that there has been more emphasis on developing BDMs than on formalizing the problem itself. Consequently, we propose a problem formalization, summarize different approaches for evaluating BDMs, and summarize several highlevel design choices made by BDMs. When we do this, we observe the following two gaps. First, existing evaluations are based on an incomplete knowledge of the model's blindspots, which limits the types of measurements and claims they can make. Second, dimensionality reduction is a relatively underexplored aspect of BDM design. Motivated by these gaps in prior work, we propose a new evaluation framework, SpotCheck, and a new BDM, PlaneSpot. SpotCheck is a synthetic evaluation framework for BDMs that gives us complete knowledge of the model's blindspots and allows us to identify factors that influence BDM performance. Additionally, we refine the metrics used by past evaluations. PlaneSpot is a simple BDM that finds blindspots using a 2D image representation. We use SpotCheck to run controlled experiments that identify factors that influence BDM performance (e.g., the number of blindspot in a model) and show that PlaneSpot outperforms existing BDMs. We run additional semi-controlled experiments using the COCO dataset (Lin et al., 2014) and find that these trends discovered using SpotCheck generalize to real image data. Overall, we hope that the methodology and analyses presented in this work will help facilitate a more rigorous science of blindspot discovery.

2. BACKGROUND

In this section, we formalize the problem of blindspot discovery for image classification. We then discuss general approaches for evaluating the Blindspot Discovery Methods (BDMs) designed to address this problem as well as high-level design choices made by BDMs. Problem Definition. The broad goal of finding systematic errors has been studied across a range of problem statements and method assumptions. Some common assumptions are: • Access to metadata help define coherent subsets of the data (e.g., Kim et al., 2018; Buolamwini & Gebru, 2018; Singh et al., 2020) . • The ability to produce counterfactual images (e.g., Shetty et al., 2019; Singla et al., 2020; Xiao et al., 2021; Leclerc et al., 2021; Bharadhwaj et al., 2021; Plumb et al., 2022) . • A specific structure for the model we are analyzing (e.g., Alvarez-Melis & Jaakkola, 2018; Koh et al., 2020) or for training process used to learn the model's parameters (e.g., Wong et al., 2021). • A human-in-the loop, either through an interactive interface (e.g., Cabrera et al., 2019; Balayn et al., 2022) or by inspecting explanations (e.g., Yeh et al., 2020; Adebayo et al., 2022) . While appropriate at times, these assumptions all restrict the applicability of their respective methods. For example, consider assuming access to metadata to help define coherent subsets of the data. To start, this metadata is much less common in applied settings than it is for common ML benchmarks. Further, the efficacy of methods that rely on this metadata is limited by the quantity and relevance of this metadata; in general, efficiently collecting large quantities of relevant metadata is challenging because it requires anticipating the model's systemic errors. Consequently, we define the problem of blindspot discovery as the problem of finding an image classification model's systemic errors without making any of these assumptions. More formally, suppose that we have an image classifier, f , and a dataset of labeled images, D = [x i ] n i=1 . Then, a blindspot is a coherent (i.e., semantically meaningful) set of images, ⇢ D, where f performs significantly worse (i.e., p(f, ) ⌧ p(f, D \ ) for some performance metric, p, such as recall). We denote the set of f 's true blindspots as : { m } M m=1 . Next, we define the problem of blindspot discovery as the problem of finding using only f and D. Then, a BDM is a method that takes as input f and D and outputs an ordered (by some definition of importance) list of hypothesized blindspots, ˆ : [ ˆ k ] K k=1 . Note that the m and ˆ k are sets of images. Approaches to BDM evaluation. We observe that existing approaches to quantitatively evaluate BDMs fall in two categories. The first category of evaluations simply measure the error rate or size of ˆ k (Singla et al., 2021; d'Eon et al., 2021) . However, these evaluations have two problems. First, none of the properties they measure capture whether ˆ k is coherent (e.g., a random sample of misclassified images has high error but may not match a single semantically meaningful description). Second, f 's performance on ˆ k may not be representative of f 's performance on similar images because BDMs are optimized to return high error images (e.g., suppose that f has a 90% accuracy on images of "squares and blue circles"; then, by returning the 10% of such images that are misclassified, a BDM could mislead us into believing that f has a 0% accuracy on this type of image). The second category of evaluations compares ˆ to a subset of that have either been artificially induced or previously found (Sohoni et al., 2020; Eyuboglu et al., 2022) . While these evaluations address the issues with those from the first category, they require knowledge of , which is usually incomplete (i.e., we usually only know a subset of ). This incompleteness makes it difficult to identify factors that influence BDM performance or to measure a BDM's recall or false positive rate. It is fundamentally impossible to fix this incompleteness using real data because we cannot enumerate



In past work, "systemic errors" have also been called "failure modes" or "hidden stratification." We introduce "blindspot" to mean the same thing and use it make it clear when we are specifically discussing blindspot discovery.

