DECONSTRUCTING DISTRIBUTIONS: A POINTWISE FRAMEWORK OF LEARNING

Abstract

In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a single input point. Specifically, we study a point's profile: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data-in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even negative correlation: cases where improving overall model accuracy actually hurts performance on these inputs. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is negatively correlated with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller et al., 2021).

1. INTRODUCTION

A central question in machine learning is: what are the machines learning? ML practitioners produce models with surprisingly good performance on inputs outside of their training distributionexhibiting new and unexpected kinds of learning such as mathematical problem solving, code generation, and unanticipated forms of robustnessfoot_0 . However, current formal performance measures are limited, and do not allow us to reason about or even fully describe these interesting settings. When measuring human learning using an exam, we do not merely assess a single student by looking at their final grade on an exam. Instead, we also look at performance on individual questions, which can assess different skills. And we consider the student's improvement over time, to see a richer picture of their learning progress. In contrast, when measuring the performance of a learning algorithm, we typically collapse measurement of its performance to just a single number. That is, existing tools from learning theory and statistics mainly consider a single model (or a single distribution over models), and measure the average performance on a single test distributions (Shalev-Shwartz & Ben-David, 2014; Tsybakov, 2009; Valiant, 1984) . Such a coarse measurement fails to capture rich aspects of learning. For example, there are many different functions which achieve 75% test accuracy on ImageNet, but it is crucial to understand which one of these functions we actually obtain when training real models. Some functions with 75% overall accuracy may fail catastrophically on certain subgroups of inputs (Buolamwini & Gebru, 2018; Koenecke et al., 2020;  Figure 1 : Learning Profiles. We consider the "input points vs. model" matrix of accuracies (i.e., probabilities of correct classification), with rows corresponding to inputs and columns corresponding to models from some parameterized family, sorted according to their global accuracy. A 70%-accurate model is on average more successful than a 30%-accurate one, but there are points on which it could do worse. In this case, the softmax probabilities of the bottom image show that only higher accuracy models recognize the existence of the soccer ball, throwing them off the "Dalmatian" label. Label noise or ambiguity is the reason behind some but not all such "accuracy non-monotonicities". Figure 1 illustrates our approach. Instead of averaging performance over a distribution of inputs, we take a "distribution free" approach, and consider pointwise performance on one input at a time. For each input point z, we consider the performance of a collection of models on z as a function of increasing resources (e.g., training time, training set size, model size, etc.). While more-resourced models have higher global accuracy, the accuracy profile for a single point z-i.e., the row corresponding to z in the points vs. models matrix-is not always monotonically increasing. That is, models with higher overall test accuracy can perform worse on certain test points. The pointwise accuracy also sometimes increases faster (for easier points) or slower (for harder ones) than the global accuracy. We also consider the full softmax profile of a point z, represented by a stackplot on the bottom of the figure depicting the softmax probabilities induced on z by this family of models. Using the softmax profile we can identify different types of points, including those that have non-monotone accuracy due to label ambiguity (as in the figure), and points with softmax entropy non-monotonicity, for which model certainty decreases with increased resources. And since our framework is "distribution free," it applies equally well to describe learning on both in-distribution and "out-of-distribution" inputs.

1.1. OUR CONTRIBUTIONS

In this paper, we initiate a systematic study of pointwise performance in ML (see Figure 1 ). We show that such pointwise analysis can be useful both as a conceptual way to reason about learning, and as a practical tool for revealing structure in learning models and datasets. Framework: Definition of learning profiles (Section 2.1). We introduce a mathematical object capturing pointwise performance: the "profile" of a point z with respect to a parameterized family of classifiers T and a test distribution D (see Section 2.1). Roughly speaking, a profile is the formalism of Figure 1 -i.e., mapping the global accuracy of classifiers to the performance on an individual point. Taxonomy of points (Section 3). Profiles allow deconstructing popular datasets such as CIFAR-10, CINIC-10, ImageNet, and ImageNet-R into points that display qualitatively distinct behavior (see Figures 3 and 4 ). For example, for compatible points the pointwise accuracy closely tracks the global accuracy, whereas for non-monotone points, the pointwise accuracy can be negatively correlated with the global accuracy. We show that a significant fraction standard datasets display noticeable non-monotonicity, awhich is fairly insensitive to the choice of architecture. Pretrained vs. End-to-End Methods (Section 3.2). Our pointwise measures reveal stark differences between pre-trained and randomly initialized classifiers, even when they share not just identical architectures but also identical global accuracy. In particular, we see that for pre-trained classifiers



For example, Devlin et al. (2018); Brown et al. (2020); Radford et al. (2021); Hendrycks et al. (2021a; 2020a; 2021c).



Hooker et al., 2019); yet other functions may fail catastrophically on "out-of-distribution" inputs. The research program of understanding models as functions, and not just via single scalars, has been developed recently (e.g. in (Nakkiran & Bansal, 2022)), and we push this program further in our work.

