ORACLE-ORIENTED ROBUSTNESS: ROBUST IMAGE MODEL EVALUATION WITH PRETRAINED MODELS AS SURROGATE ORACLE

Abstract

Machine learning has demonstrated remarkable performances over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performances in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle. Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same causal structure the original test image represents, constrained by a surrogate oracle model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performances, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.

1. INTRODUCTION

Machine learning has achieved remarkable performance over various benchmarks. For example, the recent successes of multiple pretrained models (Bommasani et al., 2021; Radford et al., 2021) , with the power gained through billions of parameters and samples from the entire internet, has demonstrated human-parallel performance in understanding natural languages (Brown et al., 2020) or even arguably human-surpassing performance in understanding the connections between languages and images (Radford et al., 2021) . Even within the scope of fixed benchmarks, machine learning has showed strong numerical evidence that the prediction accuracy over specific tasks can reach the position of the leaderboard as high as a human (Krizhevsky et al., 2012; He et al., 2015; Nangia & Bowman, 2019) , suggesting multiple application scenarios of these methods. However, these methods deployed in the real world often underdeliver its promises made through the benchmark datasets (Edwards, 2019; D'Amour et al., 2020) , usually due to the fact that these benchmark datasets, typically i.i.d, cannot sufficiently represent the diversity of the samples a model will encounter after being deployed in practice. Fortunately, multiple lines of study have aimed to embrace this challenge, and most of these works are proposing to further diversify the datasets used at the evaluation time. We notice these works mostly fall into two main categories: (1) the works that study the performances over testing datasets generated by predefined perturbation over the original i.i.d datasets, such as adversarial robustness (Szegedy et al., 2013; Goodfellow et al., 2015) or robustness against certain noises (Geirhos et al., 2019; Hendrycks & Dietterich, 2019; Wang et al., 2020b) ; and (2) the works that study the performances over testing datasets that are collected anew with a procedure/distribution different from the one for training sets, such as domain adaptation (Ben-David et al., 2007; 2010) and domain generalization (Muandet et al., 2013) . et al., 2021b; Hendrycks & Dietterich, 2019; Wang et al., 2019; Gulrajani & Lopez-Paz, 2020; Koh et al., 2021; Ye et al., 2021) . More details of these lines and their advantages and limitations and how our proposed evaluation protocol will contrast them will be discussed in the next section. In this paper, we investigate how to diversify the robustness evaluation datasets to make the evaluation results credible and representative. As shown in Figure 1 , we aim to integrate the advantages of the above two directions by introducing a new protocol to generate evaluation datasets that can automatically perturb the samples to be sufficiently different from existing test samples, while maintaining the underlying unknown causal structure with respect to an oracle (we use a CLIP model in this paper). Based on the new evaluation protocol, we introduce a new robustness measurement that directly measures the robustness compared with the oracle. With our proposed evaluation protocol and metric, we give a study of current robust machine learning techniques to identify the robustness gap between existing models and the oracle. This is particularly important if the goal of a research direction is to produce models that function reliably to have performance comparable to the oracle. Therefore, our contributions in this paper are three-fold: • We introduce a new robustness measurement that directly measures the robustness gap between models and the oracle. • We introduce a new evaluation protocol to generate evaluation datasets that can automatically perturb the samples to be sufficiently different from existing test samples, while maintaining the underlying unknown causal structure. • We leverage our evaluation metric and protocol to offer a study of current robustness research to identify the robustness gap between existing models and the oracle. Our findings further bring us understandings and conjectures of the behaviors of the deep learning models.

2.1. CURRENT ROBUSTNESS EVALUATION PROTOCOLS

The evaluation of machine learning models in non-i.i.d scenario have been studied for more than a decade, and one of the pioneers is probably domain adaptation (Ben-David et al., 2010) . In



Figure 1: (a). the main structure of our system to generate test images with surrogate oracle and examples of the generated images with their effectiveness in evaluation of model's robustness. (b). the sparse VQGAN we used to introduce unbounded perturbation. (c). the sparse feature selection method to sparsify VQGAN.

