CHARACTERIZING STRUCTURAL REGULARITIES OF LABELED DATA IN OVERPARAMETERIZED MODELS

Abstract

Humans are accustomed to environments that contain both regularities and exceptions. For example, at most gas stations, one pays prior to pumping, but the occasional rural station does not accept payment in advance. Likewise, deep neural networks can generalize across instances that share common patterns or structures, yet have the capacity to memorize rare or irregular forms. We analyze how individual instances are treated by a model via a consistency score. The score characterizes the expected accuracy for a held-out instance given training sets of varying size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple data sets, and we show that the score identifies outof-distribution and mislabeled examples at one end of the continuum and strongly regular examples at the other end. We identify computationally inexpensive proxies to the consistency score using statistics collected during training. We apply the score toward understanding the dynamics of representation learning and to filter outliers during training.

1. INTRODUCTION

Human learning requires both inferring regular patterns that generalize across many distinct examples and memorizing irregular examples. The boundary between regular and irregular examples can be fuzzy. For example, in learning the past tense form of English verbs, there are some verbs whose past tenses must simply be memorized (GO→WENT, EAT→ATE, HIT→HIT) and there are many regular verbs that obey the rule of appending "ed" (KISS→KISSED, KICK→KICKED, BREW→BREWED, etc.). Generalization to a novel word typically follows the "ed" rule, for example, BINK→BINKED. Intermediate between the exception verbs and regular verbs are subregularities-a set of exception verbs that have consistent structure (e.g., the mapping of SING→SANG, RING→RANG). Note that rule-governed and exception cases can have very similar forms, which increases the difficulty of learning each. Consider one-syllable verbs containing 'ee', which include the regular cases NEED→NEEDED as well as exception cases like SEEK→SOUGHT. Generalization from the rulegoverned cases can hamper the learning of the exception cases and vice-versa. For instance, children in an environment where English is spoken over-regularize by mapping GO→GOED early in the course of language learning. Neural nets show the same interesting pattern for verbs over the course of training (Rumelhart & McClelland, 1986) . Memorizing irregular examples is tantamount to building a look-up table with the individual facts accessible for retrieval. Generalization requires the inference of statistical regularities in the training environment, and the application of procedures or rules for exploiting the regularities. In deep learning, memorization is often considered a failure of a network because memorization implies no generalization. However, mastering a domain involves knowing when to generalize and when not to generalize, because the data manifolds are rarely unimodal. Consider the two-class problem of chair vs non-chair with training examples illustrated in Figure 1a . The iron throne (lower left) forms a sparsely populated mode (sparse mode for short) as there may not exist many similar cases in the data environment. Generic chairs (lower right) lie in a region with a consistent labeling (a densely populated mode, or dense mode) and thus seems to follow a strong regularity. But there are many other cases in the continuum of the two extreme. For example, the rocking chair (upper right) has a few supporting neighbors but it lies in a distinct neighborhood from the majority of same-label instances (the generic chairs). < l a t e x i t s h a 1 _ b a s e 6 4 = " i 2 C Q x y e v 9 X W w t w 2 W w u U U B 8 e U X 4 k = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / Y j 1 6 W S y C p 5 J I Q Y 9 F L x 4 r W F t o Q t l s N + 3 S z W 7 Y n a i h 9 K 9 4 8 a A g X v 0 h 3 v w 3 b t s c t P X B w O O 9 G W b m R a n g B j z v 2 y m t r W 9 s b p W 3 K z u 7 e / s H 7 m H 1 3 q h M U 9 a m S i j d j Y h h g k v W B g 6 C d V P N S B I J 1 o n G 1 z O / 8 8 C 0 4 U r e Q Z 6 y M C F D y W N O C V i p 7 1 Z l o P l w B E R r 9 R h w G U P e d 2 t e 3 Z s D r x K / I D V U o N V 3 v 4 K B o l n C J F B B j O n 5 X g r h h G j g V L B p J c g M S w k d k y H r W S p J w k w 4 m d 8 + x a d W G e B Y a V s S 8 F z 9 P T E h i T F 5 E t n O h M D I L H s z 8 T + v l 0 F 8 G U 6 4 T D N g k i 4 W x Z n A o P A s C D z g m l E Q u S W E a m 5 v x X R E N K F g 4 6 r Y E P z l l 1 d J 5 7 z u N + q + f 9 u o N a + K P M r o G J 2 g M + S j C 9 R E N 6 i F 2 o i i J / S M X t G b M 3 V e n H f n Y 9 F a c o q Z I / Q H z u c P f o C V M Q = = < / l a t e x i t > n → ∞ < l a t e x i t s h a 1 _ b a s e 6 4 = " B A c t h G O c O U / a P x 6 f 7 O g K O 6 z / 6 V o = " > A A A B + 3 i c b V B N S 8 N A F H y p X 7 V + p X r 0 s l g E D 1 I S K e i x 2 I v H C t Y W 2 h A 2 2 0 2 7 d L M J u x u l x P w U L x 4 U x K t / x J v / x k 3 b g 7 Y O L A w z 7 / F m J 0 g 4 U 9 p x v q 3 S 2 v r G 5 l Z 5 u 7  K z u 7 d / Y F c P 7 1 W c S k I 7 J O a x 7 A V Y U c 4 E 7 W i m O e 0 l k u I o 4 L Q b T F q F 3 3 2 g U r F Y 3 O l p Q r 0 I j w Q L G c H a S L 5 d b f n Z I M J 6 T D D P 2 v m 5 y H 2 7 5 t S d G d A q c R e k B g u 0 f f t r M I x J G l G h C c d K 9 V 0 n 0 V 6 G p W a E 0 7 w y S B V N M J n g E e 0 b K n B E l Z f N o u f o 1 C h D F M b S P K H R T P 2 9 k e F I q W k U m M k i p V r 2 C v E / r 5 / q 8 M r L m E h S T Q W Z H w p T j n S M i h 7 Q k E l K N J 8 a g o l k J i s i Y y w x 0 a a t i i n B X f 7 y K u l e 1 N 1 G 3 X V v G 7 X m 9 a K P M h z D C Z y B C 5 f Q h B t o Q w c I P M I z v M K b 9 W S 9 W O / W x 3 y 0 Z C 1 2 j u A P r M 8 f 7 0 q U Q A = = < C P,n (x, y) = E D n ∼P [P(f (x; D\{(x, y)}) = y], n = 0, 1, . . . This quantity measures the per-instance generalization on (x, y). For a fixed n, it characterizes how consistent (x, y) is with a sample D from P. Formally, it is closely tied to the fundamental notion of generalization performance when we take an expectation over (x, y). Without the expectation, the quantity gives us a fine-grain characterization of the regularity of each individual example. This article focuses on classification problems, but the definition can be easily extended to other problems by replacing the 0-1 classification loss with another suitable loss function. The quantity C P,n (x, y) has an interpretation that matches our high-level intuition about the structural regularities of the training data during (human or machine) learning. In particular, we can characterize the multimodal structure of an underlying data distribution by grouping examples in terms of a model's generalization profile for those examples when trained on data sets of increasing size. For n = 0, the model makes predictions entirely based on its prior belief. As n increases, the model collects more information about P and makes better predictions. For an (x, y) instance belonging to a dense mode (e.g., the generic chairs in Figure 1a ), the model prediction is accurate even for small n because even small samples have many class-consistent neighbors. The blue curve in the cartoon sketch of Figure 1b illustrates this profile. For instances belonging to sparse modes (e.g., the iron throne in Figure 1a ), the prediction will be inaccurate for even large n, as the red curve illustrates. Most instances fill the continuum between these two extreme cases, as illustrated by the purple curves in Figure 1b . To obtain a total ordering for all examples, we pool the consistency profile into a scalar consistency score, or C-score by taking expectation over n. Figure 1c shows examples from the ImageNet data set ranked by estimated C-scores, using a methodology we shortly describe. The images show that on many ImageNet classes, there exist dense modes of center-cropped, close-up shot of the representative examples; and at the other end of the C-score ranking, there exist sparse modes of highly ambiguous examples (in many cases, the object is barely seen or can only be inferred from the context in the picture). With strong ties to both theoretical notions of learning and human intuition, the consistency profile is an important tool for understanding the regularity and subregularity structures of training data sets and the learning dynamics of models trained on those data. The C-score based ranking also has many potential uses, such as detecting out-of-distribution and mislabeled instances; balancing learning between dense and sparse modes to ensure fairness when learning with data from underrepresented groups; or even as a diagnostic used to determine training priority in a curriculum learning setting (Bengio et al., 2009; Saxena et al., 2019) . In this article, we focus on formulating and analyzing consistency profiles, and apply the C-score to analyzing the structure of real world image data sets and the learning dynamics of different optimizers. We also study efficient proxies and further applications to outlier detection. Our key contributions are as follows:  MNIST Cifar10 Cifar100 < l a t e x i t s h a _ b a s e = " i A Y l b / r V s O k v t o T q + V h i z U = " > A A A C C i c b Z D L S s N A F I Y n V b r L e r S T W g R K p S S i K L L Y l r G A v I Q w m U b o Z N J m J m I I W T v x l d x I R t A O / G S Z q F t v w P G f c h z f i + k R E j T / N Z K K t r x v l z c r W s u n r / B N B x B H u o o A G f O B B g S l h u C u J p H g Q c g x j + K + N t n f o I E E G I X Z O G F k T B C U y n L q j F M m m n b p K D U M R Z A m n a Y G n o R G f u H r N b J q j G W w C q i B Q h X / J H A Y p z C S i U I i h Z Y b S S S C X B F G c V u x I B C i G Z z g o U I G f S y c J L l N Y V M z L G A V e P S S N f k B c i j V m e q F m u Z + V t G M n x p Z M Q F k Y S M z T / a B x R Q w Z G F o w x I h w j S W M F E H G i d j X Q F H K I p I q v o k K w F k e h t p z p v m r d n t d Z V E U c Z H I E q q A M L X I A W u A E d A U I P I J n A r e t C f t R X v X P u a t J a Y O Q R / p H + A F t J m z k = < / l a t

RELATED WORK

Analyzing the structure of data sets has been a central topic for many fields like Statistics, Data Mining and Unsupervised Learning. In this paper, we focus on supervised learning and the interplay between the regularity structure of data and overparameterized neural network learners. This differentiates our work from classical analyses based on input or (unsupervised) latent representations. The distinction is especially prominent in deep learning where a supervised learner jointly learns the classifier and the representation that captures the semantic information in the labels. In the context of deep supervised learning, Carlini et al. (2018) proposed measures for identifying prototypical examples which could serve as a proxy for the complete data set and still achieve good performance. These examples are not necessarily the center of a dense neighborhood, which is what our high C-score measures. Two prototype measures explored in Carlini et al. (2018) , model confidence and the learning speed, are also measures we examine. Their holdout retraining and ensemble agreement metrics are conceptually similar to our C-score estimation algorithm. However, their retraining is a two-stage procedure involving pre-training and fine-tuning; their ensemble agreement mixes architectures with heterogeneous capacities and ignores labels. Feldman (2020) and Feldman & Zhang (2020) studied the positive effects of memorization on generalization by measuring the influence of a training example on a test example, and identifying pairs with strong influences. To quantify memorization, they defined a memorization score for each (x, y) in a training set as the drop in prediction accuracy on x when (x, y) is removed. A point evaluation of our consistency profile on a fixed data size n resembles the second term of their score. A key difference is that we are interested in the profile with increasing n, i.e. the sample complexity required to correctly predict (x, y). We evaluate various cheap-to-compute proxies for the C-score and found that the learning speed has a strong correlation with the C-score. Learning speed has been previously studied in contexts quite different from our focus on generalization of individual examples. Mangalam & Prabhu (2019) show that examples learned first are those that could be learned by shallower nets. 

THE CONSISTENCY PROFILE AND THE C-SCORE

The consistency profile (Equation 1) encodes the structural consistency of an example with the underlying data distribution P via expected performance of models trained with increasingly large data sets sampled from P. However, it is not possible to directly compute this profile because P is generally unknown for typical learning problems. In practice, we usually have a fixed data set D consisting of N i.i.d. samples from P. So we can estimate the consistency profile with the following empirical consistency profile: Ĉ D,n (x, y) = Êr D n ∼ D [P(f (x; D\{(x, y)}) = y)] , n = 0, 1, . . . N -1 (2) where D is a subset of size n uniformly sampled from D excluding (x, y), and Êr denotes empirical averaging with r i.i.d. samples of such subsets. To obtain a reasonably accurate estimate (say, r = 1000), calculating the empirical consistency profile is still computationally prohibitive. For example, with each of the 50,000 training example in the CIFAR-10 training set, we need to train more than 2 trillion models. To obtain an estimate within the capability of current computation resources, we make two observations. First, model performance is generally stable when the training set size varies within a small range. Therefore, we can sample across the range of n that we're concerned with and obtain the full profile via smooth interpolation. Second, let D be a random subset of training data, then the single model f (• ; D) can be reused in the estimation of all of the held-out examples (x, y) ∈ D\D. As a result, with clever grouping and reuse, the number of models we need to train can be greatly reduced (See Algorithm 1 in the Appendix). In particular, we sample n dynamically according to the subset ratio s ∈ {10%, . . . , 90%} of the full available training set. We sample 2,000 subsets for the empirical expectation of each n and visualize the estimated consistency profiles for clusters of similar examples in Figure 2 . One interesting observation is that while CIFAR-100 is generally more difficult than CIFAR-10, the top ranked examples (magenta lines) in CIFAR-100 are more likely to be classified correctly when the subset ratio is low. Figure 3a visualizes the top ranked examples from the two data sets. Note that in CIFAR-10, the dense modes from the truck and automobile classes are quite similar. In contrast, Figure 2 indicates that the bottom-ranked examples (cyan lines) have persistently low probability of correct classification-sometimes below chance-even with a 90% subset ratio. We visualize some bottom-ranked examples and annotate them as (possibly) mislabeled, ambiguous (easily confused with another class or hard to identify the contents), and atypical form (e.g., burning "forest", fallen "bottle"). As the subset ratio grows, regularities in the data distribution systematically pull the ambiguous instances in the wrong direction. This behavior is analogous to the phenomenon we mentioned earlier that children over-regularize verbs (GO→GOED) as they gain more exposure to a language. To get a total ordering of the examples in a data set, we distill the consistency profiles into a scalar consistency score, or C-score, by taking the expectation over n: For the case where n is sampled according to the subset ratio s, the expectation is taken over a uniform distribution over sampled subset sizes. Ĉ D (x, y) = E n [ Ĉ D,n (x, y)] (3) < l a t e x i t s h a 1 _ b a s e 6 4 = " i A Y 8 l b / r 7 V s O k v 0 t o T q + V h 7 i 4 z U = " > A A A C C 3 i c b Z D L S s N A F I Y n 9 V b r L e r S T W g R K p S S i K L L Y l 2 4 r G A v 0 I Q w m U 7 b o Z N J m J m I I W T v x l d x 4 0 I R t 7 6 A O 9 / G S Z q F t v 4 w 8 P G f c 5 h z f i + k R E j T / N Z K K 6 t r 6 x v l z c r W 9 s 7 u n r 5 / 0 B N B x B H u o o A G f O B B g S l h u C u J p H g Q c g x 9 j + K + N 2 t n 9 f 4 9 5 o I E 7 E 7 G I X Z 8 O G F k T B C U y n L 1 q j 2 F M m m n b p K D 7 U M 5 R Z A m 1 2 n a Y G n 9 o R G f u H r N b J q 5 j G W w C q i B Q h 1 X / 7 J H A Y p 8 z C S i U I i h Z Y b S S S C X B F G c V u x I 4 B C i G Z z g o U I G f S y c J L 8 l N Y 6 V M z L G A V e P S S N 3 f 0 8 k 0 B c i 9 j 3 V m e 0 q F m u Z + V 9 t G M n x p Z M Q F k Y S M z T / a B x R Q w Z G F o w x I h w j S W M F E H G i d j X Q F H K I p I q v o k K w F k 9 e h t 5 p 0 z p v m r d n t d Z V E U c Z H I E q q A M L X I A W u A E d 0 A U I P I J n 8 A r e t C f t R X v X P u a t J a 2 Y O Q R / p H 3 + A F t J m z k = < / l a t e x i t > Ĉ D,n (x, y) MNIST CIFAR-10 CIFAR-100 (a) (b) < l a t e x i t s h a 1 _ b a s e 6 4 = " J i N j H Y R 4 d b O Q 6 n V 8 w / V 6 Z 9 F 1 F n k = " > A A A C C X i c b Z D L S s N A F I Y n 9 V b r L e r S z W A R K k h J R N F l s S 5 c V r A X a E K Y T C f t 0 M m F m Y k Y Q r Z u f B U 3 L h R x 6 x u 4 8 2 2 c p F l o 6 w 8 D H / 8 5 h z n n d y N G h T S M b 6 2 y t L y y u l Z d r 2 1 s b m 3 v 6 L t 7 P R H G H J M u D l n I B y 4 S h N G A d C W V j A w i T p D v M t J 3 p + 2 8 3 r 8 n X N A w u J N J R G w f j Q P q U Y y k s h w d W h M k 0 3 b m p A V Y P p I T j F h 6 n W V Z 4 + E k O X b 0 u t E 0 C s F F M E u o g 1 I d R / + y R i G O f R J I z J A Q Q 9 O I p J 0 i L i l m J K t Z s S A R w l M 0 J k O F A f K J s N P i k g w e K W c E v Z C r F 0 h Y u L 8 n U u Q L k f i u 6 s w 3 F f O 1 3 P y v N o y l d 2 m n N I h i S Q I 8 + 8 i L G Z Q h z G O B I 8 o J l i x R g D C n a l e I J 4 g j L F V 4 N R W C O X / y I v R O m + Z 5 0 7 g 9 q 7 e u y j i q 4 A A c g g Y w w Q V o g R v Q A V 2 A w S N 4 B q / g T X v S X r R 3 7 W P W W t H K m X 3 w R 9 r n D w 4 k m o s = < / l a t e x i t > Ĉ D (x, y) < l a t e x i t s h a 1 _ b a s e 6 4 = " J i N j H Y R 4 d b O Q 6 n V 8 w / V 6 Z 9 F 1 F n k = " > A A A C C X i c b Z D L S s N A F I Y n 9 V b r L e r S z W A R K k h J R N F l s S 5 c V r A X a E K Y T C f t 0 M m F m Y k Y Q r Z u f B U 3 L h R x 6 x u 4 8 2 2 c p F l o 6 w 8 D H / 8 5 h z n n d y N G h T S M b 6 2 y t L y y u l Z d r 2 1 s b m 3 v 6 L t 7 P R H G H J M u D l n I B y 4 S h N G A d C W V j A w i T p D v M t J 3 p + 2 8 3 r 8 n X N A w u J N J R G w f j Q P q U Y y k s h w d W h M k 0 3 b m p A V Y P p I T j F h 6 n W V Z 4 + E k O X b 0 u t E 0 C s F F M E u o g 1 I d R / + y R i G O f R J I z J A Q Q 9 O I p J 0 i L i l m J K t Z s S A R w l M 0 J k O F A f K J s N P i k g w e K W c E v Z C r F 0 h Y u L 8 n U u Q L k f i u 6 s w 3 F f O 1 3 P y v N o y l d 2 m n N I h i S Q I 8 + 8 i L G Z Q h z G O B I 8 o J l i x R g D C n a l e I J 4 g j L F V 4 N R W C O X / y I v R O m + Z 5 0 7 g 9 q 7 e u y j i q 4 A A c g g Y w w Q V o g R v Q A V 2 A w S N 4 B q / g T X v S X r R 3 7 W P W W t H K m X 3 w R 9 r n D w 4 k m o s = < / l a t e x i t > Ĉ D (x, y) < l a t e x i t s h a 1 _ b a s e 6 4 = " J i N j H Y R 4 d b O Q 6 n V 8 w / V 6 Z 9 F 1 F n k = " > A A A C C X i c b Z D L S s N A F I Y n 9 V b r L e r S z W A R K k h J R N F l s S 5 c V r A X a E K Y T C f t 0 M m F m Y k Y Q r Z u f B U 3 L h R x 6 x u 4 8 2 2 c p F l o 6 w 8 D H / 8 5 h z n n d y N G h T S M b 6 2 y t L y y u l Z d r 2 1 s b m 3 v 6 L t 7 P R H G H J M u D l n I B y 4 S h N G A d C W V j A w i T p D v M t J 3 p + 2 8 3 r 8 n X N A w u J N J R G w f j Q P q U Y y k s h w d W h M k 0 3 b m p A V Y P p I T j F h 6 n W V Z 4 + E k O X b 0 u t E 0 C s F F M E u o g 1 I d R / + y R i G O f R J I z J A Q Q 9 O I p J 0 i L i l m J K t Z s S A R w l M 0 J k O F A f K J s N P i k g w e K W c E v Z C r F 0 h Y u L 8 n U u Q L k f i u 6 s w 3 F f O 1 3 P y v N o y l d 2 m n N I h i S Q I 8 + 8 i L G Z Q h z G O B I 8 o J l i x R g D C n a l e I J 4 g j L F V 4 N R W C O X / y I v R O m + Z 5 0 7 g 9 q 7 e u y j i q 4 A A c g g Y w w Q V o g R v Q A V 2 A w S N 4 B q / g T X v S X r R 3 7 W P W W t H K m X 3 w R 9

4. THE STRUCTURAL REGULARITIES OF COMMON IMAGE DATA SETS

We apply the C-score estimate to analyze several common image data sets: MNIST, CIFAR-10, CIFAR-100, and ImageNet. See Appendix A for details on architectures and hyperparameters. Figure 4a shows the distribution of Ĉ D,n on CIFAR-10 for the values of n corresponding to each subset ratio s ∈ {10, ..., 90}. For each s, 2000 models are trained and held-out examples are evaluated. The Figure suggests that depending on s, instances may be concentrated near floor or ceiling, making them difficult to distinguish (as we elaborate further shortly). By taking an expectation over s, the C-score is less susceptible to floor and ceiling effects. Figure 4b shows the histogram of this integrated C-score on MNINT, CIFAR-10, and CIFAR-100. The histogram of CIFAR-10 in Figure 4b is distributed toward the high end, but is more uniformly spread than the histograms for specific subset ratios in Figure 4a . For MNIST, CIFAR-10, and CIFAR-100, Figure 5 Next we apply the C-score analysis to the ImageNet data set. Training a standard model on ImageNet costs one to two orders of magnitude more computing resources than training on CIFAR, preventing us from running the C-score estimation procedure described early. Instead, we investigated the feasibility of approximating the C-score with a point estimate, i.e., selection of the s that best represents the integral score. This is equivalent to taking expectation of s with respect to a point-mass distribution, as opposed to the uniform distribution over subset ratios. By 'best represents,' we mean that the ranking of instances by the score matches the ranking by the score for a particular s. Figure 6a shows the rank correlation between the integral score and the score for a given s, as a function of s for our three smaller data sets, MNIST, CIFAR-10, and CIFAR-100. Examining the green CIFAR-10 curve, there is a peak at s = 30, indicating that s = 30 yields the best point-estimate approximation for the integral C-score. That the peak is at an intermediate s is consistent with the observation from Figure 2 that the C-score bunches together instances for low and high s. For MNIST (blue curve), a less challenging data set than CIFAR-10, the peak is lower, at s = 10; for CIFAR-100 (orange curve), a more challenging data set than CIFAR-10, the peak is higher, at s = 40 

6. C-SCORE PROXIES

We are able to reduce the cost of estimating C-scores from infeasible to feasible, but the procedure is still very expensive. Ideally, we would like to have more efficient proxies that do not require training multiple models. We use the term proxy to refer to any quantity that is well correlated with the C-score but does not have a direct mathematical relation to it, as contrasted with approximations that are designed to mathematically approximate the C-score (e.g., approximating the expectation with empirical averaging). The possible candidate set for C-score proxies is very large, as any measure that reflects information about difficulty or regularity of examples could be considered. Our Related Work section mentions a few such possibilities. We examined a collection of proxies based on inter-example distances in input and latent space, but none were terribly promising. (See Appendix for details.) Inspired by our observations in the previous section that the speed-of-learning tends to correlate with the C-score rankings, we instead focus on a class of learning-speed based proxies that have the added bonus of being trivial to compute. Intuitively, a training example that is consistent with many others should be learned quickly because the gradient steps for all consistent examples should be well aligned. One might therefore conjecture that strong regularities in a data set are not only better learned at asymptote-leading to better generalization performance-but are also learned sooner in the time course of training. This learning speed hypothesis is nontrivial, because the C-score is defined for a held-out instance following training, whereas learning speed is defined for a training instance during training. This hypothesis is qualitatively verified from Figure 8 . In particular, the cyan examples having the lowest C-scores are learned most slowly and the purple examples having the highest C-scores are learned most quickly. Indeed, learning speed is monotonically related to C-score bin. Figure 9b shows a quantitative evaluation, where we compute the Spearman's rank correlation between the C-score of an instance and various proxy scores based on learning speed. In particular, we test accuracy (0-1 correctness), p L (softmax confidence on the correct class), p max (max softmax confidence across all classes) and entropy (negative entropy of softmax confidences). We use cumulative statistics which average from the beginning of training to the current epoch because the cumulative statistics yield a more stable measure-and higher correlation-than statistics based on a single epoch. We also compare to a forgetting-event statistic (Toneva et al., 2019) , which is simply a count of the number of transitions from "learned" to "forgotten" during training. All of these proxies show strong correlation with the C-score. p L reaches ρ ≈ 0.9 at the peak. p max and entropy perform similarly, both slightly worse than p L . The plot also shows that examples with low cscores are more likely to be forgotten during training. The forgetting event based proxy slightly under-performs other proxies and takes a larger number of training epochs to reach its peak correlation. We suspect this is because forgetting events happen only after an example is learned, so unlike other proxies studied here, forgetting statistics for hard examples cannot be obtained in the earlier stage of training. Finally, we present a simple demonstration of the utility of our proxies for outlier detection. We corrupt a fraction γ = 25% of the CIFAR-10 training set with random label assignments. Then during training, we identify the fraction γ with the lowest ranking by three C-score proxies-cumulative accuracy, p L , and forgetting-event statistics. Figure 9c shows the removal rate-the fraction of the lowest ranked examples which are indeed outliers; two of the C-score proxies successfully identify over 95% of outliers.

7. DISCUSSION

We formulated a consistency profile for individual examples in a data set that reflects the probability of correct generalization to the example as a function of training set size. This profile has strong ties to generalization theory and matches basic intuitions about data regularity in both human and machine learning. We distilled the profile into a scalar C-score, which provides a total ordering of the instances in a data set by essentially the sample complexity-the amount of training data required-to ensure correct generalization to the instance. To leverage the C-score to analyze structural regularities in complex data sets, we derived a C-score estimation procedure and obtained C-scores for examples in MNIST, CIFAR-10, CIFAR-100, and ImageNet. The C-score estimate helps to characterize the continuum between a densely populated mode consisting of aligned, centrally cropped examples with unified shape and color profiles, and sparsely populated modes of just one or two instances. We also used the C-score as a tool to compare the learning dynamics of different optimizers to better understand the consequences for generalization. To make the C-score useful in practice for new data sets, we explored a number of efficient proxies for the C-score based on the learning speed, and found that the integrated accuracy over training is a strong indicator of an example's ranking by C-score, which reflects generalization accuracy had the example been held out from the training set. In the 1980s, neural nets were touted for learning rule-governed behavior without explicit rules (Rumelhart & McClelland, 1986) . At the time, AI researchers were focused on constructing expert systems by extracting explicit rules from human domain experts. Expert systems ultimately failed because the diversity and nuance of statistical regularities in a domain was too great for any human to explicate. In the modern deep learning era, researchers have made much progress in automatically extracting regularities from data. Nonetheless, there is still much work to be done to understand these regularities, and how the consistency relationships among instances determine the outcome of learning. By defining and investigating a consistency score, we hope to have made some progress in this direction. 

A EXPERIMENT DETAILS

The details on model architectures, data set information and hyper-parameters used in the experiments for empirical estimation of the C-score can be found in Table 1 . We implement our experiment in Tensorflow (Abadi et al., 2015) . The holdout subroutine used in C-score estimation is listed in Algorithm 1. Most of the training jobs for C-score estimation are run on single NVidia ® Tesla P100 GPUs. The ImageNet training jobs are run with 8 P100 GPUs using single-node multi-GPU data parallelization. The experiments on learning speed are conducted with ResNet-18 on CIFAR-10, trained for 200 epochs while batch size is 32. For optimizer, we use the SGD with the initial learning rate 0.1, momentum 0.9 (with Nesterov momentum) and weight decay is 5e-4. The stage-wise constant learning rate scheduler decrease the learning rate at the 60th, 90th, and 120th epoch with a decay factor of 0.2.  Ĉ ∈ R N : ( Ĉ D,n (x, y)) (x,y)∈ D Initialize binary mask matrix M ← 0 k×N Initialize 0-1 loss matrix L ← 0 k×N for i ∈ (1, 2, . . . , k) do Sample n random indices I from {1, . . . , N } M [i, I] ← 1 Train f from scratch with the subset X[I], Y [I] L[i, :] ← 1[ f (X) = Y ] end for Initialize score estimation vector Ĉ ← 0 N for j ∈ (1, 2, . . . , N ) do Q ← ¬M [:, j] Ĉ[j] ← sum(¬L[:, Q])/sum(Q) end for

B TIME AND SPACE COMPLEXITY

The time complexity of the holdout procedure for empirical estimation of the C-score is O(S(kT + E)). Here S is the number of subset ratios, k is number of holdout for each subset ratio, and T is the average training time for a neural network. E is the time for computing the score given the k-fold holdout training results, which involves elementwise computation on a matrix of size k × N , and is negligible comparing to the time for training neural networks. The space complexity is the space for training a single neural network times the number of parallel training jobs. The space complexity for computing the scores is O(kN ). For kernel density estimation based scores, the most expensive part is forming the pairwise distance matrix (and the kernel matrix), which requires O(N 2 ) space and O(N 2 d) time, where d is the dimension of the input or hidden representation spaces.

C MORE VISUALIZATIONS OF IMAGES RANKED BY C-SCORE

Examples with high, middle and low C-scores from a few representative classes of MNIST, CIFAR-10 and CIFAR-100 are show in Figure 5 . In this appendix, we depict the results for all the 10 classes of MNIST and CIFAR-10 in Figure 10 and Figure 11 , respectively. The results from the first 60 out of the 100 classes on CIFAR-100 is depicted in Figure 12 . Figure 13 and Figure 14 show more examples from ImageNet. Please see (URL anonymized) for more visualizations.  O c o Q H 4 l r C p S O N c = " > A A A C S X i c b Z B L S w M x F I U z 9 V 1 f V Z d u g k V x V W Z E U H e i I C 4 K V r A q d G r N Z G 7 b Y C Y z J H f E M s z f c + P O n f / B j Q t F X J k + E F 8 X A l / O u T e P E y R S G H T d J 6 c w N j 4 x O T U 9 U 5 y d m 1 9 Y L C 0 t n 5 s 4 1 R z q P J a x v g y Y A S k U 1 F G g h M t E A 4 s C C R f B z W H f v 7 g F b U S s z r C X Q D N i H S X a g j O 0 U q t 0 7 Q f Q E S p j U n Q U h H l x w + 8 y z A 5 z 3 / / C q + r 3 T e Y n E a 3 m P y W E O 8 y q J 0 d 5 X / d B h V 8 H t k p l t + I O i v 4 F b w R l M q p a q / T o h z F P I 1 D I J T O m 4 b k J N j O m U X A J e d F P D S S M 3 7 A O N C w q F o F p Z o M k c r p u l Z C 2 Y 2 2 X Q j p Q v 0 9 k L D K m F w W 2 M 2 L Y N b + 9 v v i f 1 0 i x v d v M h E p S B M W H F 7 V T S T G m / V h p K D R w l D 0 L j G t h 3 0 p 5 l 2 n G 0 Y Z f t C F 4 v 7 / 8 F 8 6 3 K t 5 2 Z e 9 0 u 7 x / M I p j m q y S N b J J P L J D 9 s k x q Z E 6 4 e S e P J N X 8 u Y 8 O C / O u / M x b C 0 4 o 5 k V 8 q M K Y 5 / S P b L z < / l a t e x i t > CIFAR-100 We study C-score proxies based on pairwise distances here. Intuitively, an example is consistent with the data distribution if it lies near other examples having the same label. However, if the example lies far from instances in the same class or lies near instances of different classes, one might not expect it to generalize. Based on this intuition, we define a relative local-density score: Ĉ±L (x, y) = 1 N N i=1 2(1[y = y i ] -1 2 )K(x i , x), where K(x, x ) = exp(-xx 2 /h 2 ) is an RBF kernel with the bandwidth h, and 1[•] is the indicator function. To evaluate the importance of explicit label information, we study two related scores: ĈL that uses only same-class examples when estimating the local density, and Ĉ that uses all the neighbor examples by ignoring the labels. ĈL (x, y) = 1 N N i=1 1[y = y i ]K(x i , x), Ĉ(x) = 1 N N i=1 K(x i , x). We also study a proxy based on the local outlier factor (LOF) algorithm (Breunig et al., 2000) , which measures the local deviation of each point with respect to its neighbours. Since large LOF scores indicate outliers, we use the negative LOF score as a C-score proxy, denoted by ĈLOF (x). Table 2 shows the agreement between the proxy scores and the estimated C-score. Agreement is quantified by two rank correlation measures on three data sets. As anticipated, the input-density score that ignores labels, Ĉ(x), and the class-conditional density, ĈL (x, y), have poor agreement. Ĉ±L (x, y) and ĈLOF are slightly better. However, none of the proxies has high enough correlation to be useful, because it is very hard to obtain semantically meaningful distance estimations from the raw pixels. Since proxies based on pairwise distances in the input space work poorly, we further evaluate the proxies using the penultimate layer of the network as a representation of an image: Ĉ±L h , ĈL h , Ĉh and ĈLOF h , with the subscript h indicating that the score operates in hidden space. For each score and data set, we compute Spearman's rank correlation between the proxy score and the C-score. In particular, we train neural network models with the same specification as in Table 1 on the full training set. We use an RBF kernel K(x, x ) = exp(-x-x 2 /h 2 ), where the bandwidth parameter h is adaptively chosen as 1/2 of the mean pairwise Euclidean distance across the data set. For the local outlier factor (LOF) algorithm (Breunig et al., 2000) , we use the neighborhood size k = 3. See Figure 16 for the behavior of LOF across a wide range of neighborhood sizes. Because the embedding changes as the network is trained, we plot the correlation as a function of training epoch in Figure 15 . For both data sets, the proxy score that correlates best with the C-score is Ĉ±L h (grey), followed by ĈLOF h (brown), then ĈL h (pink) and Ĉh (blue). Clearly, appropriate use of labels helps with the ranking. However, our proxy Ĉ±L h uses the labels in an ad hoc manner. We will discuss a more principled measure based on gradient vectors shortly and relate it to the neural tangent kernel Jacot et al. (2018) . The results reveal interesting properties of the hidden representation. One might be concerned that as training progresses, the representations will optimize toward the classification loss and may discard inter-class relationships that could be potentially useful for other downstream tasks (Scott et al., 2018) . However, our results suggest that Ĉ±L h does not diminish as a predictor of the C-score, even long after training converges. Thus, at least some information concerning the relation between different examples is retained in the representation, even though intra-and inter-class similarity is not very relevant for a classification model. To the extent that the hidden representation-crafted through a discriminative loss-preserves class structure, one might expect that the C-score could be predicted without label reweighting; however, the poor performance of Ĉh suggests otherwise. Figure 17 and Figure 18 visualize examples in CIFAR-10/CIFAR-100 ranked by the class weighted local density scores in the input and learned hidden space, respectively. The ranking calculated in the input space relies heavily on low level features that can be derived directly from the pixels like strong silhouette. The rankings calculated from the learned hidden space correlate better with C-score, though the visualization shows that the ranking are sometimes still noisy even for the top ranking examples (e.g. the class "automobile" in CIFAR-10).

D.1 PAIRWISE DISTANCE ESTIMATION WITH GRADIENT REPRESENTATIONS

Most modern neural networks are trained with first order gradient descent based algorithms and variants. In each iteration, the gradient of loss on a mini-batch of training examples evaluated at the current network weights is computed and used to update the current parameter. Let ∇ t (•) be the function that maps an input-label training pair (the case of mini-batch size one) to the corresponding gradient evaluated at the network weights of the t-th iteration. Then this defines a gradient based representation on which we can compute density based ranking scores. The intuition is that in a gradient based learning algorithm, an example is consistent with others if they all compute similar gradients. Comparing to the hidden representations defined the outputs of a neural network layer, the gradient based representations induce a more natural way of incorporating the label information. In the previous section, we reweight the neighbor examples belonging to a different class by 0 or -1. For gradient based representations, no ad hoc reweighting is needed as the gradient is computed on the loss that has already takes the label into account. Similar inputs with different labels automatically lead to dissimilar gradients. Moreover, this could seamlessly handle labels and losses with rich structures (e.g. image segmentation, machine translation) where an effective reweighting scheme is hard to find. The gradient based representation is closely related to recent developments on Neural Tagent Kernels (NTK) (Jacot et al., 2018) . It is shown that when the network width goes to infinity, the neural network training dynamics can be effectively approximately via Taylor expansion at the initial network weights. In other words, the algorithm is effectively learning a linear model on the nonlinear representations defined by ∇ 0 (•). This feature map induces the NTK, and connects deep learning to the literature of kernel machines. Although NTK enjoys nice theoretical properties, it is challenging to perform density estimation on it. Even for the more practical case of finite width neural networks, the gradient representations are of extremely high dimensions as modern neural networks general have parameters ranging from millions to billions (e.g. Tan & Le, 2019; Radford et al., 2019) . As a result, both computation and memory requirements are prohibitive if a naive density estimation is to be computed on the gradient representations. We leave as future work to explore efficient algorithms to practically compute this score.

E WHAT MAKES AN ITEM REGULAR OR IRREGULAR?

The notion of regularity is primarily coming from the statistical consistency of the example with the rest of the population, but less from the intrinsic structure of the example's contents. To illustrate this, we refer back to the experiments in Section 5 on measuring the learning speed of groups of examples generated via equal partition on the C-score value range [0, 1]. As shown in Figure 4b , the distribution is uneven between high and low C-score values. As a result, the high C-score groups will have more examples than the low C-score groups. This agrees with the intuition that regularity arises from high probability masses. To test whether an example with top-ranking C-score is still highly regular after the density of its neighborhood is reduced, we redo the experiment, but subsample each group to contain an equal number (∼ 400) of examples. Then we run training on this new data set and observe the learning speed in each (subsampled) group. The result is shown in Figure 19 , which is to be compared with the results without group-size-equalizing in Figure 8a in the main text. The following observations can be made: 1. The learning curves for many of the groups start to overlap with each other. figure shows that a good correlation is found for as few as m = 64 models. However, the integral C-score requires training models for various subset ratios (9 different subset ratios in our simulations), so the total number of models needed is roughly 64 × 9. If we want to obtain a reliable estimate of the C-score under a single fixed subset ratio, we find that we need 512 models in order to get a > .95 correlation with C 1k-2k . So it appears that whether we are computing the integral C-score or the C-score for a particular subset ratio, we need to train on the order of 500-600 models. In the analysis above, we have used C 1k-2k as the reference scores to compute correlation to ensure no overlapping between the models used to compute different estimates. Note C 1k-2k itself is well correlated with the the full estimate from 2,000 models, as demonstrated by the following correlations: ρ(C 0-1k , C 1k-2k ) = 0.9996, ρ(C 0-1k , C 0-2k ) = 0.9999, and ρ(C 1k-2k , C 0-2k ) = 0.9999.

G CODE AND PRE-COMPUTED C-SCORES

We provide code implementing our C-score estimation algorithms, and pre-computed C-scores and associated model checkpoints for CIFAR-10, CIFAR-100 and ImageNet at (URL anonymized). The exported files are in Numpy's data format saved via numpy.savez. For CIFAR-10 and CIFAR-100, the exported file contains two arrays labels and scores. Both arrays are stored in the order of training examples as defined by the original data sets found at https://www.cs.toronto. edu/~kriz/cifar.html. The data loading tools provided in some deep learning library might not be following the original data example orders, so we provided the labels array for easy sanity check of the data ordering. For ImageNet, since there is no well defined example ordering, we order the exported scores arbitrarily, and include a script to reconstruct the data set with index information by using the filename of each example to help identify the example-score mapping.



t e x i t s h a 1 _ b a s e 6 4 = " L 9 j y A U H + p t Z b m + gh K u z Y b W Y 8 T p 8 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k U C 9 C 0 Y v H i t Y W 2 l A 2 2 0 m 7 d L M J u x u h h P 4 E L x 4 U x K t / y J v / x m 2 b g 7 Y + G H i 8 N 8 P M v C A R X B v X / X Y K a + s b m 1 v F 7 d L O 7 t 7 + Q f n w 6 F H H q W L Y Y r G I V S e g G g W X 2 D L c C O w k C m k U C G w H 4 5 u Z 3 3 5 C p X k s H 8 w k Q T + i Q 8 l D z q i x 0 r 2 8 c v v l i l t 1 5 y C r x M t J B X I 0 + + W v 3 i B m a Y T S M E G 1 7 n p u Y v y M K s O Z w G m p l 2 p M K B v T I X Y t l T R C 7 W f z U 6 f k z C o D E s b K l j R k r v 6 e y G i k 9 S Q K b G d E z U g v e z P x P 6 + b m v D S z 7 h M U o O S L R a F q S A m J r O / y Y A r Z E Z M L K F M c X sr Y S O q K D M 2 n Z I N w V t + e Z W 0 L 6 p e r e p 5 d 7 V K 4 z r P o w g n c A r n 4 E E d G n A L T W g B g y E 8 w y u 8 O c J 5 c d 6 d j 0 V r w c l n j u E P n M 8 f X b y N r A = = < / l a t e x i t > n = 0

Figure 1: Regularities and exceptions in a binary chairs vs non-chairs problem. (b) illustration of consistency profiles. (c) Regularities (high C-scores) and exceptions (low C-scores) in ImageNet.

Figure 3: (a) Top ranked examples in CIFAR-10 and CIFAR-100. (b) Bottom ranked examples with annotations.

Figure 4: (a) Histogram of Ĉ D,n for each subset ratio on CIFAR-10. (b) Histogram of the C-score Ĉ D averaged over all subset ratios on 3 different data sets.

Figure 5: Examples from MNIST (blocks 1, 2), CIFAR-10 (blocks 3-6), and CIFAR-100 (blocks 7-10). Each block shows a single class; the left, middle, and right columns of a block depict instances with high, intermediate, and low C-scores, respectively.

presents instances that vary in C-score. Each block of examples is one category; the left, middle, and right columns have high, intermediate, and low C-scores, respectively. The homogeneity of examples in the left column suggests dense modes generally consist of central cropped images of well aligned instances in their typical poses, with highly uniform color schemes. In contrast, many of the examples in the right column are in atypical forms or even ambiguous.

Figure 6: (a) Rank correlation between integral C-score and the C-score for a particular subset ratio, s. The peak of each curve indicates the training set size that best reveals generalization of the model. (b) Joint distribution of C-score per-class means and standard deviations on Ima-geNet. Samples from representative classes ( 's) are shown in Figure 7.

Figure 7: Example images from ImageNet. The 5 classes are chosen to have representative per-class C-score mean-standard-deviation profiles, as shown in Figure 6a. For each class, the three columns show sampled images from the (C-score ranked) top 99%, 35%, and 1% percentiles, respectively. The bottom pane shows the histograms of the C-scores in each of the 5 classes.

Figure 9: (a) Learning speed of CIFAR-10 examples grouped by C-score with SGD using constant learning rate. The 4 learning rates correspond to the constants in the stage-wise scheduler in Figure 8a. The test accuracies for those models are 84.84%, 91.19%, 92.05%, and 90.82%, respectively. (b) Rank correlation (Spearman's ρ) between C-score and training statistics based proxies. (c) Using C-score proxies to identify outliers on CIFAR-10.

Estimation of Ĉ D,n Input: Data set D = (X, Y ) with N examples Input: n: number of instances used for training Input: k: number of subset samples Output:

Stage3 :: Block(176, 160) → Block(176, 160). Block(C1, C2) :: Concat(Conv(1×1, C1), Conv(3×3,C2)).Conv :: Convolution → BatchNormalization → ReLU. ∧(15%) learning rate scheduler linearly increase the learning rate from 0 to the base learning rate in the first 15% training steps, and then from there linear decrease to 0 in the remaining training steps. LinearRampupPiecewiseConstant learning rate scheduler linearly increase the learning rate from 0 to the base learning rate in the first 15% training steps. Then the learning rate remains piecewise constant with a 10× decay at 30%, 60% and 90% of the training steps, respectively. Random Padded Cropping pad 4 pixels of zeros to all the four sides of MNIST, CIFAR-10, CIFAR-100 images and (randomly) crop back to the original image size. For ImageNet, a padding of 32 pixels is used for all four sides of the images.

Details for the experiments used in the empirical estimation of the C-score.

Figure 10: Examples from MNIST. Each block shows a single class; the left, middle, and right columns of a block depict instances with high, intermediate, and low C-scores, respectively.

Figure 11: Examples from CIFAR-10. Each block shows a single class; the left, middle, and right columns of a block depict instances with high, intermediate, and low C-scores, respectively.

Figure 12: Examples from CIFAR-100. Each block shows a single class; the left, middle, and right columns of a block depict instances with high, intermediate, and low C-scores, respectively. The first 60 (out of the 100) classes are shown.

t e x i t s h a 1 _ b a s e 6 4 = " L N e w J 4 H c s T d h 0

Figure 15: Spearman rank correlation between C-score and distance-based score on hidden representations.

Figure 16: The Spearman's ρ correlation between the C-score and the score based on LOF with different neighborhood sizes.

Figure 17: Examples from CIFAR-10 (left 5 blocks) and CIFAR-100 (right 5 blocks). Each block shows a single class; the left, middle, and right columns of a block depict instances with top, intermediate, and bottom ranking according to the relative local density score Ĉ±L in the input space, respectively.

Figure 18: Examples from CIFAR-10 (left 5 blocks) and CIFAR-100 (right 5 blocks). Each block shows a single class; the left, middle, and right columns of a block depict instances with top, intermediate, and bottom ranking according to the relative local density score Ĉ±L h in the latent representation space of a trained network, respectively.

Figure20: The correlation of C-scores estimated with varying numbers of models (the x-axis) and C-scores estimated with 1,000 independent models. The simulations are run with CIFAR-10, and the error bars show standard deviation from 10 runs.

The learning curves have wider dispersion in SGD than in Adam. Early in SGD training where the learning rate is large, the examples with the lowest C-scores barely learn. Figure9ashows SGD with constant learning rate corresponding to values used in each of the learning rate stages in Figure8a, confirming that with a large learning rate, the groups with the lowest C-scores are not learned well even after the full 200 epochs of training. In comparison, Adam shows less spread among the groups and as a result, converges sooner. However, the superior convergence speed of adaptive optimizers like Adam does not always lead to better generalization(Wilson et al., 2017;Keskar & Socher, 2017;Luo et al., 2019). We observe this outcome as well: SGD with a stagewise learning rate achieves 95.14% test accuracy, compared to 92.97% for Adam.

Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019. Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148-4158, 2017. Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817-8826, 2018.

A simplified Inception model suitable for small image sizes, defined as follows: Inception :: Conv(3×3, 96) → Stage1 → Stage2 → Stage3 → GlobalMaxPool → Linear. Stage1 :: Block(32, 32) → Block(32, 48) → Conv(3×3, 160, Stride=2). Stage2 :: Block(112, 48) → Block(96, 64) → Block(80, 80) → Block (48, 96) → Conv(3×3, 240, Stride=2).

Rank correlation between C-score and pairwise distance based proxies on inputs. Measured with Spearman's ρ and Kendall's τ rank correlations, respectively.

annex

 5 ). As a result, the within group diversities of the highest ranked groups are still much smaller than the lowest ranked groups.In summary, the regularity of an example arises from its consistency relation with the rest of the population. A regular example in isolation is no different to an outlier. Moreover, it is also not merely an intrinsic property of the data distribution, but is closely related to the model, loss function and learning algorithms. For example, while a picture with a red lake and a purple forest is likely be considered an outlier in the usual sense, for a model that only uses grayscale information it could be highly regular.

F SENSITIVITY OF C-SCORES TO THE NUMBER OF MODELS

We used 2,000 models per subset ratio to evaluate C-scores in our experiments to ensure that we get stable estimates. In this section, we study the sensitivity of C-scores with respect to the number of models and evaluate the possibility to use fewer models in practice. Let C 0-2k be the C-scores estimated with the full 2,000 models per subset ratio. We split the 2,000 models for each subset ratio into two halves, and obtain two independent estimates C 0-1k and C 1k-2k . Then for m ∈ {1, 2, 4, 8, 16, 32, 64, 128, 256, 512 , 1000}, we sample m random models from the first 1,000 split, and estimate C-scores (denoted by C m ) based on those models. We compute the Spearman's ρ correlation between each C m and C 1k-2k . The results are plotted in Figure 20 . The random sampling of m models is repeated 10 times for each m and the error bars show the standard deviations. The

