HOW HARD ARE COMPUTER VISION DATASETS? CALIBRATING DATASET DIFFICULTY TO VIEWING TIME

Abstract

Humans outperform object recognizers despite the fact that models perform well on current datasets. Numerous attempts have been made to create more challenging datasets by scaling them up from the web, exploring distribution shift, or adding controls for biases. The difficulty of each image in each dataset is not independently evaluated, nor is the concept of dataset difficulty as a whole well-posed. We develop a new dataset difficulty metric based on how long humans must view an image in order to classify a target object. Images whose objects can be recognized in 17ms are considered to be easier than those which require seconds of viewing time. Using 133,588 judgments on two major datasets, ImageNet and ObjectNet, we determine the distribution of image difficulties in those datasets, which we find varies wildly, but significantly undersamples hard images. Rather than hoping that distribution shift or other approaches will lead to hard datasets, we should measure the difficulty of datasets and seek to explicitly fill out the class of difficult examples. Analyzing model performance guided by image difficulty reveals that models tend to have lower performance and a larger generalization gap on harder images. Encouragingly for the biological validity of current architectures, much of the variance in human difficulty can be accounted for given an object recognizer by computing a combination of prediction depth, c-score, and adversarial robustness. We release a dataset of such judgments as a complementary metric to raw performance and a network's ability to explain neural recordings. Such experiments with humans allow us to create a metric for progress in object recognition datasets, which we find are skewed toward easy examples, to test the biological validity of models in a novel way, and to develop tools for shaping datasets as they are being gathered to focus them on filling out the missing class of hard examples from today's datasets.

1. INTRODUCTION

Numerous efforts exist to build better evaluations for object recognizers. Broadly, these fall into four categories. Those that probe distribution shift, like ImageNetV2 (Recht et al., 2019) . Those that add scale like OpenImages (Kuznetsova et al., 2020) . Those that explicitly attempt to make images more difficult for models by adversarially selecting them, like ImageNet-A (Hendrycks et al., 2021) or adding artificial corruptions, like ImageNet-C (Hendrycks & Dietterich, 2019) . And those that attempt to explicitly control for biases like ObjectNet (Barbu et al., 2019) . These are responses to the fact that performance on standard benchmarks does not translate well to real-world conditions; 90% accuracy for one class in ImageNet does not mean that the detector will achieve 90% accuracy for that class in one's home or on frames of a movie. In all four cases, these efforts have no objective guide, no metric that evaluates how far they have progressed towards enabling models to generalize. We set out to measure an orthogonal quantity -how difficult images in these datasets are for humans. Distribution shift and bias control won't on their own address this problem if datasets are overwhelmingly easy compared to what humans are capable of recognizing. And while scale helps, if datasets are heavily skewed toward images that are easy for humans, the statistics of performance on such datasets may hide the real underlying performance trends of models on harder images.

availability

Dataset and analysis code can be found at https://github.com/image-flash/

