ORACLE-ORIENTED ROBUSTNESS: ROBUST IMAGE MODEL EVALUATION WITH PRETRAINED MODELS AS SURROGATE ORACLE

Abstract

Machine learning has demonstrated remarkable performances over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performances in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle. Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same causal structure the original test image represents, constrained by a surrogate oracle model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performances, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.

1. INTRODUCTION

Machine learning has achieved remarkable performance over various benchmarks. For example, the recent successes of multiple pretrained models (Bommasani et al., 2021; Radford et al., 2021) , with the power gained through billions of parameters and samples from the entire internet, has demonstrated human-parallel performance in understanding natural languages (Brown et al., 2020) or even arguably human-surpassing performance in understanding the connections between languages and images (Radford et al., 2021) . Even within the scope of fixed benchmarks, machine learning has showed strong numerical evidence that the prediction accuracy over specific tasks can reach the position of the leaderboard as high as a human (Krizhevsky et al., 2012; He et al., 2015; Nangia & Bowman, 2019) , suggesting multiple application scenarios of these methods. However, these methods deployed in the real world often underdeliver its promises made through the benchmark datasets (Edwards, 2019; D'Amour et al., 2020) , usually due to the fact that these benchmark datasets, typically i.i.d, cannot sufficiently represent the diversity of the samples a model will encounter after being deployed in practice. Fortunately, multiple lines of study have aimed to embrace this challenge, and most of these works are proposing to further diversify the datasets used at the evaluation time. We notice these works mostly fall into two main categories: (1) the works that study the performances over testing datasets generated by predefined perturbation over the original i.i.d datasets, such as adversarial robustness (Szegedy et al., 2013; Goodfellow et al., 2015) or robustness against certain noises (Geirhos et al., 2019; Hendrycks & Dietterich, 2019; Wang et al., 2020b) ; and (2) the works that study the performances over testing datasets that are collected anew with a procedure/distribution different from the one for training sets, such as domain adaptation (Ben-David et al., 2007; 2010) and domain generalization (Muandet et al., 2013) .

