EVALUATING NATURAL LANGUAGE PROCESSING MOD-ELS WITH GENERALIZATION METRICS THAT DO NOT NEED ACCESS TO ANY TRAINING OR TESTING DATA

Abstract

The search for effective and robust metrics has been the focus of recent theoretical and empirical work on generalization of deep neural networks (NNs). In this paper, we discuss the performance of natural language processing (NLP) models, and we evaluate various existing and novel generalization metrics. Compared to prior studies, we (i) focus on NLP instead of computer vision (CV), (ii) focus on generalization metrics that predict test error instead of the generalization gap, (iii) focus on generalization metrics that do not need the access to data, and (iv) focus on the heavy-tail (HT) phenomenon that has received comparatively less attention in the study of deep neural networks. We extend recent HT-based work which focuses on power law (PL) distributions, and we study exponential (EXP) and exponentially truncated power law (E-TPL) fitting to the empirical spectral densities (ESDs) of weight matrices. Our empirical studies are carried on (i) hundreds of Transformers trained in different settings, in which we systematically vary the amount of data, the model size and the optimization hyperparameters, (ii) a total of 51 pretrained Transformers from eight families of Huggingface NLP models, including BERT, GPT2, ALBERT, etc., and (iii) a total of 28 existing and novel generalization metrics. From our detailed empirical analyses, we show that shape metrics, or the metrics obtained from fitting the shape of the ESDs, perform uniformly better at predicting generalization performance than scale metrics commonly studied in the literature, as measured by the average rank correlations with the generalization performance for all of our experiments. We also show that among the three HT distributions considered in our paper, the E-TPL fitting of ESDs performs the most robustly when the models are trained in experimental settings, while the PL fitting achieves the best performance on well-trained Huggingface models, and that both E-TPL and PL metrics (which are both shape metrics) outperform scale metrics.

1. INTRODUCTION

Recent years have seen a wide array of large-scale empirical studies on the various metrics used to quantify generalization (Dziugaite et al., 2020; Jiang et al., 2019; Martin & Mahoney, 2021a; Martin et al., 2021) . On the one hand, theory-driven metrics have the potential to reveal more information than test error, bringing us one step closer to unpacking the black box of deep NNs (Frankle & Carbin, 2018; Nakkiran et al., 2019; Zhang et al., 2021) . On the other hand, a wide variety of generalization metrics have been applied to predict the quality of pretrained models (Martin & Mahoney, 2019; Martin et al., 2021) , design effective training procedures (Foret et al., 2020; Izmailov et al., 2018) , improve network efficiency (Chen et al., 2020; Dong et al., 2019) , quantify network robustness (Tanay & Griffin, 2016; Yang et al., 2020) , improve ensemble learning techniques (Fort et al., 2019; Garipov et al., 2018) , analyze and improve large-scale machine learning contests (Martin & Mahoney, 2021a) , and so on. Despite advances in the study of generalization, however, several recent papers point out the deficiencies of many of these "fantastic" generalization metrics. These include a lack of "robustness" to the changes of environmental hyperparameters (Dziugaite et al., 2020; Jiang et al., 2019) (such as data, network architecture and training schemes), or the Simpson's paradox that generalization metrics perform differently (i.e., predict opposite trends) when applied to each sub-part of a collection of learning models or to the holistic study (Martin & Mahoney, 2021a) . Another drawback is the over-reliance on experiments with CV models, which are relatively well-explored, and which are not representative of many other application areas. Despite a few counterexamples (Martin et al., 2021; Nakkiran et al., 2019; Yang et al., 2021) , systematic studies of generalization in other fields, such as NLP, are largely missing. Generalization metrics for NLP. The objective of this paper is to provide a systematic study of generalization metrics in NLP, addressing several deficiencies in prior studies (Dziugaite et al., 2020; Jiang et al., 2019; Martin et al., 2021) . Compared to CV, predicting generalization in NLP has several important differences that require careful consideration. The training data from standard CV benchmarks can often be easily obtained, while NLP pretraining datasets are typically web-scale and are challenging to access. Therefore, generalization metrics that can measure the quality of learning models without access to data are ideal for NLP. Indeed, recent work has demonstrated that access to training or testing data is not necessary for assessing the model quality of learning models (Martin et al., 2021) , though these have yet to be evaluated at scale in the NLP domain. Furthermore, it is typically infeasible to train NLP models to interpolate the (frequently large) training set. This becomes an issue when applying most existing generalization metrics as they often estimate the generalization gap (i.e., the difference between training and test performance) rather than the test error itself. Metrics that focus on predicting the generalization gap include most of the well-known metrics in CV, such as those based on the PAC-Bayesian framework (McAllester, 1999; Neyshabur et al., 2018) and margins (Bartlett et al., 2017; Jiang et al., 2018; Pitas et al., 2017) . To illustrate the issue, consider the problem of model selection between two models (Jiang et al., 2020; Martin & Mahoney, 2021a ).Suppose we are given two classification models. Then even if we have i) access to both models' training errors, and ii) a metric which is guaranteed to perfectly rank correlate with the generalization gap, then we still cannot determine which model as smaller test error. This means that, if our objective is to construct a metric that correctly predicts which model has lower test error, rank correlation with the generalization gap is not sufficient. In this paper, we aim to study how generalization metrics correlate with model quality, for which we use test error as a close approximation. As we will demonstrate (in Figure 4 ), rank correlation with the generalization gap indeed does not imply rank correlation with model quality in practice, and in fact often orders models in the opposite order of their test errors. From a practical point of view, for NLP tasks, we prefer generalization metrics that can directly predict trends in test error (or similar evaluation metrics in NLP, such as the test BLEU score (Papineni et al., 2002) ) rather than trends in the generalization gap. Naturally, we cannot expect a metric to be universally correlated with test error if evaluating the metric does not need data. However, within certain classes of models (e.g., stages of training in one model or across pre-trained models), they may be effective at diagnosing model quality. With these objectives in mind, among the generalization metrics in the literature, we take particular interest in those derived from the heavy-tail self regularization (HT-SR) theory (Martin & Mahoney, 2019 , 2021b) , which (i) predicts test error directly instead of the generalization gap and (ii) does not require access to training (or testing) data. HT-SR theory. The core principle of HT-SR theory is that HT structures arise naturally in the ESDs of the weight matricesfoot_0 as the result of extracting various correlations in data during optimization (Martin & Mahoney, 2019 , 2021a,b; Martin et al., 2021) . Its primary practical consequence is that by estimating the PL coefficient from the ESDs (requiring only weights), one can predict model quality, as smaller coefficients are reported to correspond to higher test accuracy. However, these estimators can be unstable, and so one must be careful not to rely on them alone. The quality of the PL fit itself should also point to similar conclusions (Martin & Mahoney, 2021b), which can be a sanity check. The principles of HT-SR theory extend beyond fitting the PL coefficient, however, as ESDs can take many forms. To this end, we study three different types of distributions to fit to the ESDs of weight matrices, including power laws (PL) in Eqn. (1), exponentially truncated power laws (E-TPL) in Eqn. (2), and exponential laws (EXP) in Eqn. (3). These are all commonly considered families of distributions in classical studies of PL (Clauset et al., 2009) , and it is often hard in practice to predict which family fits data the best (as we show in this paper, this is true for deep NNs especially).



The ESD of a weight matrix W refers to the empirical density of the eigenvalues of the squared weight matrix W > W. See "Preliminary of ESDs of weight matrices" at the end of the Introduction.

