DISCOVERING BUGS IN VISION MODELS USING OFF-THE-SHELF IMAGE GENERATION AND CAPTIONING

Abstract

Automatically discovering failures in vision models under real-world settings remains an open challenge. This work shows how off-the-shelf, large-scale, imageto-text and text-to-image models, trained on vast amounts of data, can be leveraged to automatically find such failures. In essence, a conditional text-to-image generative model is used to generate large amounts of synthetic, yet realistic, inputs given a ground-truth label. A captioning model is used to describe misclassified inputs. Descriptions are used in turn to generate more inputs, thereby assessing whether specific descriptions induce more failures than expected. As failures are grounded to natural language, we automatically obtain a high-level, human-interpretable explanation of each failure. We use this pipeline to demonstrate that we can effectively interrogate classifiers trained on IMAGENET to find specific failure cases and discover spurious correlations. We also show that we can scale the approach to generate adversarial datasets targeting specific classifier architectures. This work demonstrates the utility of large-scale generative models to automatically discover bugs in vision models in an open-ended manner. We also describe a number of limitations and pitfalls related to this approach.

1. INTRODUCTION

Deep learning has enabled breakthroughs in a wide variety of fields (Goodfellow et al., 2016; Krizhevsky et al., 2012; Hinton et al., 2012) , and deep neural networks are ubiquitous in many applications, including autonomous driving (Bojarski et al., 2016) and medical imaging (De Fauw et al., 2018) . Unfortunately, these models are known to exhibit numerous failures arising from using shortcuts and spurious correlations (Geirhos et al., 2020a; Arjovsky et al., 2019; Torralba et al., 2011; Kuehlkamp et al., 2017) . As a result, they can fail catastrophically when training and deployment data differ (Buolamwini & Gebru, 2018) . Hence, it is important to ensure that models are robust and generalize to new deployment settings. Yet, only a few tools exist to automatically find failure cases on unseen data. Some methods analyze the performance of models by collecting new datasets (usually by scraping the web). These datasets must be large enough to obtain some indication of how models perform on a particular subset of inputs (Hendrycks et al., 2019; 2020; Recht et al., 2019) . Other methods rely on expertly crafted, synthetic (and often unrealistic) datasets that highlight particular shortcomings (Geirhos et al., 2022; Xiao et al., 2020) . In this work, we present a methodology to automatically find failure cases of image classifiers in an open-ended manner, without prior assumptions on the types of failures and how they arise. We leverage off-the-shelf, large-scale, text-to-image, generative models, such as DALL•E 2 (Ramesh et al., 2022 ), IMAGEN (Saharia et al., 2022) or STABLE-DIFFUSION (Rombach et al., 2022) , to obtain realistic images that can be reliably manipulated using the text prompt. We also leverage captioning models, such as FLAMINGO (Alayrac et al., 2022) or LEMON (Hu et al., 2021) , to retrieve humaninterpretable descriptions of each failure case. This provides the following advantages: (i) generative models trained on web-scale datasets can be re-used and have broad non-domain-specific coverage; (ii) they demonstrate basic compositionality, can generate novel data and can faithfully capture the essence of (most) prompts, thereby allowing images to be realistically manipulated; (iii) textual

