CHARACTERIZING STRUCTURAL REGULARITIES OF LABELED DATA IN OVERPARAMETERIZED MODELS

Abstract

Humans are accustomed to environments that contain both regularities and exceptions. For example, at most gas stations, one pays prior to pumping, but the occasional rural station does not accept payment in advance. Likewise, deep neural networks can generalize across instances that share common patterns or structures, yet have the capacity to memorize rare or irregular forms. We analyze how individual instances are treated by a model via a consistency score. The score characterizes the expected accuracy for a held-out instance given training sets of varying size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple data sets, and we show that the score identifies outof-distribution and mislabeled examples at one end of the continuum and strongly regular examples at the other end. We identify computationally inexpensive proxies to the consistency score using statistics collected during training. We apply the score toward understanding the dynamics of representation learning and to filter outliers during training.

1. INTRODUCTION

Human learning requires both inferring regular patterns that generalize across many distinct examples and memorizing irregular examples. The boundary between regular and irregular examples can be fuzzy. For example, in learning the past tense form of English verbs, there are some verbs whose past tenses must simply be memorized (GO→WENT, EAT→ATE, HIT→HIT) and there are many regular verbs that obey the rule of appending "ed" (KISS→KISSED, KICK→KICKED, BREW→BREWED, etc.). Generalization to a novel word typically follows the "ed" rule, for example, BINK→BINKED. Intermediate between the exception verbs and regular verbs are subregularities-a set of exception verbs that have consistent structure (e.g., the mapping of SING→SANG, RING→RANG). Note that rule-governed and exception cases can have very similar forms, which increases the difficulty of learning each. Consider one-syllable verbs containing 'ee', which include the regular cases NEED→NEEDED as well as exception cases like SEEK→SOUGHT. Generalization from the rulegoverned cases can hamper the learning of the exception cases and vice-versa. For instance, children in an environment where English is spoken over-regularize by mapping GO→GOED early in the course of language learning. Neural nets show the same interesting pattern for verbs over the course of training (Rumelhart & McClelland, 1986) . Memorizing irregular examples is tantamount to building a look-up table with the individual facts accessible for retrieval. Generalization requires the inference of statistical regularities in the training environment, and the application of procedures or rules for exploiting the regularities. In deep learning, memorization is often considered a failure of a network because memorization implies no generalization. However, mastering a domain involves knowing when to generalize and when not to generalize, because the data manifolds are rarely unimodal. Consider the two-class problem of chair vs non-chair with training examples illustrated in Figure 1a . The iron throne (lower left) forms a sparsely populated mode (sparse mode for short) as there may not exist many similar cases in the data environment. Generic chairs (lower right) lie in a region with a consistent labeling (a densely populated mode, or dense mode) and thus seems to follow a strong regularity. But there are many other cases in the continuum of the two extreme. For example, the rocking chair (upper right) has a few supporting neighbors but it lies in a distinct neighborhood from the majority of same-label instances (the generic chairs).  A U H + p t Z b m + g h K u z Y b W Y 8 T p 8 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k U C 9 C 0 Y v H i t Y W 2 l A 2 2 0 m 7 d L M J u x u h h P 4 E L x 4 U x K t / y J v / x m 2 b g 7 Y + G H i 8 N 8 P M v C A R X B v X / X Y K a + s b m 1 v F 7 d L O 7 t 7 + Q f n w 6 F H H q W L Y Y r G I V S e g G g W X 2 D L c C O w k C m k U C G w H 4 5 u Z 3 3 5 C p X k s H 8 w k Q T + i Q 8 l D z q i x 0 r 2 8 c v v l i l t 1 5 y C r x M t J B X I 0 + + W v 3 i B m a Y T S M E G 1 7 n p u Y v y M K s O Z w G m p l 2 p M K B v T I X Y t l T R C 7 W f z U 6 f k z C o D E s b K l j R k r v 6 e y G i k 9 S Q K b G d E z U g v e z P x P 6 + b m v D S z 7 h M U o O S L R a F q S A m J r O / y Y A r Z E Z M L K F M c X s r Y S O q K D M 2 n Z I N w V t + e Z W + q + f 9 u o N a + K P M r o G J 2 g M + S j C 9 R E N 6 i F 2 o i i J / S M X t G b M 3 V e n H f n Y 9 F a c o q Z I / Q H z u c P f o C V M Q = = < / l a t e x i t > n → ∞ < l a t e x i t s h a 1 _ b a s e 6 4 = " B A c t h G O c O U / a P x 6 f 7 O g K O 6 z / 6 V o = " > A A A B + 3 i c b V B N S 8 N A F H y p X 7 V + p X r 0 s l g E D 1 I S K e i x 2 I v H C t Y W 2 h A 2 2 0 2 7 d L M J u x u l x P w U L x 4 U x K t / x J v / x k 3 b g 7 Y O L A w z 7 / F m J 0 g 4 U 9 p x v q 3 S 2 v r G 5 l Z 5 u 7 K z u 7 d / Y F c P 7 1 W c S k I 7 J O a x 7 A V Y U c 4 E 7 W i m O e 0 l k u I o 4 L Q b T F q F 3 3 2 g U r F Y 3 O l p Q r 0 I j w Q L G c H a S L 5 d b f n Z I M J 6 T D D P 2 v m 5 y H 2 7 5 t S d G d A q c R e k B g u 0 f f t r M I x J G l G h C c d K 9 V 0 n 0 V 6 G p W a E 0 7 w y S B V N M J n g E e 0 b K n B E l Z f N o u f o 1 C h D F M b S P K H R T P 2 9 k e F I q W k U m M k i p V r 2 C v E / r 5 / q 8 M r L m E h S T Q W Z H w p T j n S M i h 7 Q k E l K N J 8 a g o l k J i s i Y y w x 0 a a t i i n B X f 7 y K u l e 1 N 1 G 3 X V v G 7 X m 9 a K P M h z D C Z y B C 5 f Q h B t o Q w c I P M I z v M K b 9 W S 9 W O / W x 3 y 0 Z C 1 2 j u A P r M 8 f 7 0 q U Q A = = < C P,n (x, y) = E D n ∼P [P(f (x; D\{(x, y)}) = y], n = 0, 1, . . . ( ) This quantity measures the per-instance generalization on (x, y). For a fixed n, it characterizes how consistent (x, y) is with a sample D from P. Formally, it is closely tied to the fundamental notion of generalization performance when we take an expectation over (x, y). Without the expectation, the quantity gives us a fine-grain characterization of the regularity of each individual example. This article focuses on classification problems, but the definition can be easily extended to other problems by replacing the 0-1 classification loss with another suitable loss function. The quantity C P,n (x, y) has an interpretation that matches our high-level intuition about the structural regularities of the training data during (human or machine) learning. In particular, we can characterize the multimodal structure of an underlying data distribution by grouping examples in terms of a model's generalization profile for those examples when trained on data sets of increasing size. For n = 0, the model makes predictions entirely based on its prior belief. As n increases, the model collects more information about P and makes better predictions. For an (x, y) instance belonging to a dense mode (e.g., the generic chairs in Figure 1a ), the model prediction is accurate even for small n because even small samples have many class-consistent neighbors. The blue curve in the cartoon sketch of Figure 1b illustrates this profile. For instances belonging to sparse modes (e.g., the iron throne in Figure 1a ), the prediction will be inaccurate for even large n, as the red curve illustrates. Most instances fill the continuum between these two extreme cases, as illustrated by the purple curves in Figure 1b . To obtain a total ordering for all examples, we pool the consistency profile into a scalar consistency score, or C-score by taking expectation over n. Figure 1c shows examples from the ImageNet data set ranked by estimated C-scores, using a methodology we shortly describe. The images show that on many ImageNet classes, there exist dense modes of center-cropped, close-up shot of the representative examples; and at the other end of the C-score ranking, there exist sparse modes of highly ambiguous examples (in many cases, the object is barely seen or can only be inferred from the context in the picture). With strong ties to both theoretical notions of learning and human intuition, the consistency profile is an important tool for understanding the regularity and subregularity structures of training data sets and the learning dynamics of models trained on those data. The C-score based ranking also has many potential uses, such as detecting out-of-distribution and mislabeled instances; balancing learning between dense and sparse modes to ensure fairness when learning with data from underrepresented groups; or even as a diagnostic used to determine training priority in a curriculum learning setting (Bengio et al., 2009; Saxena et al., 2019) . In this article, we focus on formulating and analyzing consistency profiles, and apply the C-score to analyzing the structure of real world image data sets and the learning dynamics of different optimizers. We also study efficient proxies and further applications to outlier detection. Our key contributions are as follows:



t e x i t s h a 1 _ b a s e 6 4 = " L 9 j y

0 L 6 p e r e p 5 d 7 V K 4 z r P o w g n c A r n 4 E E d G n A L T W g B g y E 8 w y u 8 O c J 5 c d 6 d j 0 V r w c l n j u E P n M 8 f X b y N r A = = < / l a t e x i t > n = 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " i 2 C Q x y e v 9 X W w t w 2 W w u U U B 8 e U X 4 k = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / Y j 1 6 W S y C p 5 J I Q Y 9 F L x 4 r W F t o Q t l s N + 3 S z W 7 Y n a i h 9 K 9 4 8 a A g X v 0 h 3 v w 3 b t s c t P X B w O O 9 G W b m R a n g B j z v 2 y m t r W 9 s b p W 3 K z u 7 e / s H 7 m H 1 3 q h M U 9 a m S i j d j Y h h g k v W B g 6 C d V P N S B I J 1 o n G 1 z O / 8 8 C 0 4 U r e Q Z 6 y M C F D y W N O C V i p 7 1 Z l o P l w B E R r 9 R h w G U P e d 2 t e 3 Z s D r x K / I D V U o N V 3 v 4 K B o l n C J F B B j O n 5 X g r h h G j g V L B p J c g M S w k d k y H r W S p J w k w 4 m d 8 + x a d W G e B Y a V s S 8 F z 9 P T E h i T F 5 E t n O h M D I L H s z 8 T + v l 0 F 8 G U 6 4 T D N g k i 4 W x Z n A o P A s C D z g m l E Q u S W E a m 5 v x X R E N K F g 4 6 r Y E P z l l 1 d J 5 7 z u N

Figure 1: Regularities and exceptions in a binary chairs vs non-chairs problem. (b) illustration of consistency profiles. (c) Regularities (high C-scores) and exceptions (low C-scores) in ImageNet.

