POST-MORTEM ON A DEEP LEARNING CONTEST: A SIMPSON'S PARADOX AND THE COMPLEMENTARY ROLES OF SCALE METRICS VERSUS SHAPE METRICS

Abstract

To understand better good generalization performance in state-of-the-art neural network (NN) models, and in particular the success of the AlphaHat metric based on Heavy-Tailed Self-Regularization (HT-SR) theory, we analyze of a corpus of models that was made publicly-available for a contest to predict the generalization accuracy of NNs. These models include a wide range of qualities and were trained with a range of architectures and regularization hyperparameters. We break AlphaHat into its two subcomponent metrics: a scale-based metric; and a shapebased metric. We identify what amounts to a Simpson's paradox: where "scale" metrics (from traditional statistical learning theory) perform well in aggregate, but can perform poorly on subpartitions of the data of a given depth, when regularization hyperparameters are varied; and where "shape" metrics (from HT-SR theory) perform well on each subpartition of the data, when hyperparameters are varied for models of a given depth, but can perform poorly overall when models with varying depths are aggregated. Our results highlight the subtlety of comparing models when both architectures and hyperparameters are varied; the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality; and the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of NN models. Our results also clarify further why the AlphaHat metric from HT-SR theory works so well at predicting generalization across a broad range of CV and NLP models.

1. INTRODUCTION

It is of increasing interest to develop metrics to measure and monitor the quality of Deep Neural Network (DNN) models, especially in production environments, where data pipelines can unexpectedly fail, training data can become corrupted, and errors can be difficult to detect. There are few good methods which can readily diagnose problems at a layer-by-layer level and in an automated way. Motivated by this, recent work introduced the AlphaHat metric, (i.e., α), showing that it can predict trends in the quality, or generalization capacity, of state-of-the-art (SOTA) DNN models without access to any training or testing data (Martin et al., 2021)-outperforming other metrics from statistical learning theory (SLT) in a large meta-analysis of hundreds of SOTA models from computer vision (CV) and natural language processing (NLP). The α metric is based on the recently-developed Heavy-Tailed Self-Regularization (HT-SR) theory (Martin & Mahoney, 2021; 2019; 2020) , which is based on statistical mechanics and Heavy-Tailed (HT) random matrix theory. Further, being a weighted average of layer metrics, understanding why AlphaHat works will help practitioners to diagnose potential problems layer-by-layer. In this paper, we evaluate the AlphaHat (α) metric (and its subcomponents) on a series of pretrained DNN models from a recent contest ("the Contest") to predict generalization in deep learning (Jiang et al., 2020a; b) . The Contest was interested in metrics that were "causally informative of generalization," and it wanted participants to propose a "robust and general complexity measure" (Jiang et al., 2020a; b) . These Contest models were smaller and more narrow than those analyzed in the large-scale meta-analysis (Martin et al., 2021) . However, for that narrower class of models, the Contest data was more detailed. There were models with a wider range of test accuracies, including models that generalize well, generalize poorly, and even models which appear to be overtrained. The models are partitioned into sub-groups of fixed depth, where regularization hyperparameters (and width) are varied. This more fine-grained set of pre-trained models lets us evaluate

