DISCOVERING BUGS IN VISION MODELS USING OFF-THE-SHELF IMAGE GENERATION AND CAPTIONING

Abstract

Automatically discovering failures in vision models under real-world settings remains an open challenge. This work shows how off-the-shelf, large-scale, imageto-text and text-to-image models, trained on vast amounts of data, can be leveraged to automatically find such failures. In essence, a conditional text-to-image generative model is used to generate large amounts of synthetic, yet realistic, inputs given a ground-truth label. A captioning model is used to describe misclassified inputs. Descriptions are used in turn to generate more inputs, thereby assessing whether specific descriptions induce more failures than expected. As failures are grounded to natural language, we automatically obtain a high-level, human-interpretable explanation of each failure. We use this pipeline to demonstrate that we can effectively interrogate classifiers trained on IMAGENET to find specific failure cases and discover spurious correlations. We also show that we can scale the approach to generate adversarial datasets targeting specific classifier architectures. This work demonstrates the utility of large-scale generative models to automatically discover bugs in vision models in an open-ended manner. We also describe a number of limitations and pitfalls related to this approach.

1. INTRODUCTION

Deep learning has enabled breakthroughs in a wide variety of fields (Goodfellow et al., 2016; Krizhevsky et al., 2012; Hinton et al., 2012) , and deep neural networks are ubiquitous in many applications, including autonomous driving (Bojarski et al., 2016) and medical imaging (De Fauw et al., 2018) . Unfortunately, these models are known to exhibit numerous failures arising from using shortcuts and spurious correlations (Geirhos et al., 2020a; Arjovsky et al., 2019; Torralba et al., 2011; Kuehlkamp et al., 2017) . As a result, they can fail catastrophically when training and deployment data differ (Buolamwini & Gebru, 2018) . Hence, it is important to ensure that models are robust and generalize to new deployment settings. Yet, only a few tools exist to automatically find failure cases on unseen data. Some methods analyze the performance of models by collecting new datasets (usually by scraping the web). These datasets must be large enough to obtain some indication of how models perform on a particular subset of inputs (Hendrycks et al., 2019; 2020; Recht et al., 2019) . Other methods rely on expertly crafted, synthetic (and often unrealistic) datasets that highlight particular shortcomings (Geirhos et al., 2022; Xiao et al., 2020) . In this work, we present a methodology to automatically find failure cases of image classifiers in an open-ended manner, without prior assumptions on the types of failures and how they arise. We leverage off-the-shelf, large-scale, text-to-image, generative models, such as DALL•E 2 (Ramesh et al., 2022 ), IMAGEN (Saharia et al., 2022) or STABLE-DIFFUSION (Rombach et al., 2022) , to obtain realistic images that can be reliably manipulated using the text prompt. We also leverage captioning models, such as FLAMINGO (Alayrac et al., 2022) or LEMON (Hu et al., 2021) , to retrieve humaninterpretable descriptions of each failure case. This provides the following advantages: (i) generative models trained on web-scale datasets can be re-used and have broad non-domain-specific coverage; (ii) they demonstrate basic compositionality, can generate novel data and can faithfully capture the essence of (most) prompts, thereby allowing images to be realistically manipulated; (iii) textual descriptions of failures can be easily interpreted (even by non-experts) and interrogated (e.g., by performing counterfactual analyses). Overall, our contributions are as follows: • We describe a methodology to discover failures of image classifiers trained on IMAGENET (Deng et al., 2009) . To the contrary of prior work, we leverage off-the-shelf generative models, thereby avoiding the need to collect new datasets or to rely on manually crafted synthetic images. • Our approach surfaces failures that are human-interpretable by clustering and captioning inputs on which classifiers fail. These captions can be modified produce alternative hypotheses of why failures occur allowing insights into the limitations of a given model. 2020) investigate whether models are biased towards background cues by compositing foreground objects with various background images (IMAGENET-9, WATERBIRDS). In all cases, building such datasets is time-consuming and requires expert knowledge. Automated failure discovery. In some instances, it is possible to distill rules or specifications that constrain the input space enough to enable the automated discovery of failures via optimization or brute-force search. In vision tasks, adversarial examples, which are constructed using `p-norm bounded perturbations of the input, can cause neural networks to make incorrect predictions with high confidence (Carlini & Wagner, 2017a; b; Goodfellow et al., 2014; Kurakin et al., 2016; Szegedy et al., 2013) . In language tasks, some efforts manually compose templates to generate test cases for specific failures (Jia & Liang, 2017; Garg et al., 2019; Ribeiro et al., 2020) . Such approaches rely on human creativity and are intrinsically difficult to scale. Several works (Baluja & Fischer, 2017; Song et al., 2018; Xiao et al., 2018; Qiu et al., 2019; Wong & Kolter, 2021; Laidlaw et al., 2020; Gowal et al., 2019) go beyond hard-coded rules by leveraging generative and perceptual models. However, such approaches are difficult to automate as it is unclear how to relate specific latent variables to isolated structures of the original input. Finally, we highlight a concurrent work (Ge et al., 2022) , which leverages captioning and text-to-image models to construct background images to evaluate (and improve) an object detector. Their approach requires compositing the resulting images with foreground objects and is not open-ended, in the sense that it requires a dataset of background images. Perhaps, the work by Perez et al. ( 2022) on red-teaming language models is the most similar to ours.



Evaluation datasets. Understanding how model failures arise and empirically analyzing their consequences often requires collecting and annotating new test datasets. Hendrycks et al. (2019) collected datasets of natural adversarial examples (IMAGENET-A and IMAGENET-O) to evaluate how model performance degrades when inputs have limited spurious cues. Hendrycks et al. (2020) collected four real-world datasets (including IMAGENET-R) to understand how models behave under distribution shifts. In many cases, particular shortcomings can only be explored using synthetic datasets(Cimpoi et al., 2013). Hendrycks & Dietterich (2018) introduced IMAGENET-C, a synthetic set of common corruptions.Geirhos et al. (2018)  propose to use images with a texture-shape cue conflict to evaluate the propensity of models to over-emphasize texture cues.Xiao et al. (2020);  Sagawa et al. (

We demonstrate the scalability of the approach by generating adversarial datasets (akin toIMAGENET-A; Hendrycks et al., 2019). In contrast to IMAGENET-A, our new generated datasets align more closely with the original training distribution from IMAGENET and generalize to multiple classifier architectures. Importantly, while this work focuses on vision models trained on IMAGENET, it is neither limited to IMAGENET nor the visual domain. It serves as a proof-of-concept that demonstrates how large-scale, off-the-shelf, generative models(Bommasani et al., 2021)  can be combined to automate the discovery of bugs in machine learning models and produce compelling, interpretable descriptions of model failures. The approach is agnostic to the model architecture, which can be treated as a black box.

