DECONSTRUCTING DISTRIBUTIONS: A POINTWISE FRAMEWORK OF LEARNING

Abstract

In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a single input point. Specifically, we study a point's profile: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data-in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even negative correlation: cases where improving overall model accuracy actually hurts performance on these inputs. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is negatively correlated with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller et al., 2021).

1. INTRODUCTION

A central question in machine learning is: what are the machines learning? ML practitioners produce models with surprisingly good performance on inputs outside of their training distributionexhibiting new and unexpected kinds of learning such as mathematical problem solving, code generation, and unanticipated forms of robustnessfoot_0 . However, current formal performance measures are limited, and do not allow us to reason about or even fully describe these interesting settings. When measuring human learning using an exam, we do not merely assess a single student by looking at their final grade on an exam. Instead, we also look at performance on individual questions, which can assess different skills. And we consider the student's improvement over time, to see a richer picture of their learning progress. In contrast, when measuring the performance of a learning algorithm, we typically collapse measurement of its performance to just a single number. That is, existing tools from learning theory and statistics mainly consider a single model (or a single distribution over models), and measure the average performance on a single test distributions (Shalev-Shwartz & Ben-David, 2014; Tsybakov, 2009; Valiant, 1984) . Such a coarse measurement fails to capture rich aspects of learning. For example, there are many different functions which achieve 75% test accuracy on ImageNet, but it is crucial to understand which one of these functions we actually obtain when training real models. Some functions with 75% overall accuracy may fail catastrophically on certain subgroups of inputs (Buolamwini & Gebru, 2018; Koenecke et al., 2020;  



For example, Devlin et al. (2018); Brown et al. (2020); Radford et al. (2021); Hendrycks et al. (2021a; 2020a; 2021c). 1

