POST-MORTEM ON A DEEP LEARNING CONTEST: A SIMPSON'S PARADOX AND THE COMPLEMENTARY ROLES OF SCALE METRICS VERSUS SHAPE METRICS

Abstract

To understand better good generalization performance in state-of-the-art neural network (NN) models, and in particular the success of the AlphaHat metric based on Heavy-Tailed Self-Regularization (HT-SR) theory, we analyze of a corpus of models that was made publicly-available for a contest to predict the generalization accuracy of NNs. These models include a wide range of qualities and were trained with a range of architectures and regularization hyperparameters. We break AlphaHat into its two subcomponent metrics: a scale-based metric; and a shapebased metric. We identify what amounts to a Simpson's paradox: where "scale" metrics (from traditional statistical learning theory) perform well in aggregate, but can perform poorly on subpartitions of the data of a given depth, when regularization hyperparameters are varied; and where "shape" metrics (from HT-SR theory) perform well on each subpartition of the data, when hyperparameters are varied for models of a given depth, but can perform poorly overall when models with varying depths are aggregated. Our results highlight the subtlety of comparing models when both architectures and hyperparameters are varied; the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality; and the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of NN models. Our results also clarify further why the AlphaHat metric from HT-SR theory works so well at predicting generalization across a broad range of CV and NLP models.

1. INTRODUCTION

It is of increasing interest to develop metrics to measure and monitor the quality of Deep Neural Network (DNN) models, especially in production environments, where data pipelines can unexpectedly fail, training data can become corrupted, and errors can be difficult to detect. There are few good methods which can readily diagnose problems at a layer-by-layer level and in an automated way. Motivated by this, recent work introduced the AlphaHat metric, (i.e., α), showing that it can predict trends in the quality, or generalization capacity, of state-of-the-art (SOTA) DNN models without access to any training or testing data (Martin et al., 2021) -outperforming other metrics from statistical learning theory (SLT) in a large meta-analysis of hundreds of SOTA models from computer vision (CV) and natural language processing (NLP). The α metric is based on the recently-developed Heavy-Tailed Self-Regularization (HT-SR) theory (Martin & Mahoney, 2021; 2019; 2020) , which is based on statistical mechanics and Heavy-Tailed (HT) random matrix theory. Further, being a weighted average of layer metrics, understanding why AlphaHat works will help practitioners to diagnose potential problems layer-by-layer. In this paper, we evaluate the AlphaHat (α) metric (and its subcomponents) on a series of pretrained DNN models from a recent contest ("the Contest") to predict generalization in deep learning (Jiang et al., 2020a; b) . The Contest was interested in metrics that were "causally informative of generalization," and it wanted participants to propose a "robust and general complexity measure" (Jiang et al., 2020a; b) . These Contest models were smaller and more narrow than those analyzed in the large-scale meta-analysis (Martin et al., 2021) . However, for that narrower class of models, the Contest data was more detailed. There were models with a wider range of test accuracies, including models that generalize well, generalize poorly, and even models which appear to be overtrained. The models are partitioned into sub-groups of fixed depth, where regularization hyperparameters (and width) are varied. This more fine-grained set of pre-trained models lets us evaluate the α metric, and its subcomponents, across the opposing dimensions of depth and hyperparameter changes, and more finely than analyzed previously on SOTA models. Our analysis here provides new insights-on how theories for generalization perform on well-trained versus poorly-trained models; on how this depends in subtle ways on what can be interpreted as implicit scale versus implicit shape parameters of the models learned by these DNNs; and on how model quality metrics depend on architectural parameters versus solver parameters. Most importantly, this work helps clarify why the AlphaHat metric performs so well across so many models. Background: Heavy-Tailed Self-Regularization (HT-SR) Theory. HT-SR theory is a phenomenology, based on Random Matrix Theory (RMT), and motivated by the statistical mechanics of learning, that explains empirical results on the spectral (eigenvalue) properties of SOTA DNNs (Martin & Mahoney, 2021). (A detailed discussion of HT-ST theory can be found in Martin & Mahoney (2021); here we can only summarize the basics.) Empirical results (Martin & Mahoney, 2021; 2019; 2020) show that, for nearly all well-trained DNN models (in CV and NLP), the layer Correlation Matrices X = 1 N W T W are HT, in the sense of being well-fit to a Power Law (PL) or truncated PL distribution (even though individual matrices W are not HT elementwise). Moreover, HT-SR theory indicates that as training proceeds and/or regularization is increased, the HTness of the correlations (however it is measured) generally increases (Martin & Mahoney, 2021) . Using these facts, HT-SR theory allows one to construct various generalization capacity metrics for DNNs (which are implemented in the WeightWatcher open source tool (wei, 2018)) that measure the model average HTness (Alpha, AlphaHat, LogSpectralNorm, etc.) as a proxy for generalization capacity. For moderately and very HT weight matrices, it is possible to quantify the HTness using a standard PL fit of the ESD. In this case, smaller PL exponents (α) correspond to heavier tails. Given a DNN weight matrix W, (N × M, N ≥ M ), let λ be an eigenvalue of the correlation matrix X = 1 N W T W. The Empirical Spectral Density (ESD), ρ(λ), is the just an empirical fit of the histogram of the M eigenvalues. In looking at hundreds of models and thousands of weight matrices, the tail of the ESDs of well trained DNNs can nearly always be well fit to a PL distribution: ρ tail (λ) ∼ λ -α , x min ≤ λ ≤ x max . (1) Here, x max = λ max is the maximum eigenvalue of the ESD, and x min is the start, fit using the procedure of Clauset et al. (2009) . The best fit is given by the KS-distance D KS (denoted QualityOfAlphaFit below). For models that generalize well, the fitted α 2.0 for (nearly) every layer (Martin & Mahoney, 2021; 2019; 2020) . Models that generalize better generally have a smaller (weighted) average α, AlphaHat (α), when compared within an architecture series (VGG, ResNet, ...) or with different size data sets (GPT vs GPT2) (Martin et al., 2021) . Still one may ask, "Why are the layer correlations matrices X HT, but the weight matrices W themselves are not ?" For the ESD of W to be HT, either W must be HT elementwise, having many spuriously large elements W i,j , and/or X simply has many large eigenvalues. It is well known that the well-regularized models should not contain spuriously large elements W ij , and methods like L2-regularization attempt to ensure this. In contrast, the largest eigenvalues of X will correspond to eigenvectors with the most important non-random information. Consequently, λ max will become larger, not smaller, as more information is learned. (This is well known from RMT. However, it is also known in machine learning, as it is the basis for methods like Latent Semantic Analysis (LSA).) The HT-SR theory exploits these facts to build a theory of generalization for DNNs. Using layer PL fits, one can define the HT-based AlphaHat metric for an entire DNN model. This metric can predict trends in the quality of SOTA DNNs, even without access to training or testing data (Martin et al., 2021) . AlphaHat (α) is a weighted average over L layers of two complementary metrics, the PL exponent α and the logarithm maximum eigenvalue log λ max : α = L l=1 α l log λ max l (2) AlphaHat is both a weighted-average Alpha, weighted by the Scale of the layer ESD (log λ max l ), and a weighted-average LogSpectralNorm, weighted by the Shape of the layer ESD(α). We therefore evaluate how these two subcomponents, the Alpha and LogSpectralNorm metrics, individually perform when varying the opposing dimensions of depth (number of layers L) and regularization hyperparameters (θ), such as dropout, momentum, weight decay, etc. In investigating "why AlphaHat works," we discovered that the Alpha and LogSpectralNorm metrics frequently display a Simpson's paradox. Being aware of challenges with designing good

