EFFECTIVE DIMENSION OF MACHINE LEARNING MODELS

Abstract

Making statements about the performance of trained models on tasks involving new data is one of the primary goals of machine learning, i.e., to understand the generalization power of a model. Various capacity measures try to capture this ability, but usually fall short in explaining important characteristics of models that we observe in practice. In this study, we propose the local effective dimension as a capacity measure which seems to correlate well with generalization error on standard data sets. Importantly, we prove that the local effective dimension bounds the generalization error and discuss the aptness of this capacity measure for machine learning models.

1. INTRODUCTION

The essence of successful machine learning lies in the creation of a model that is able to learn from data and apply what it has learned to new, unseen data (Goodfellow et al., 2016) . The latter ability is termed the generalization performance of a machine learning model and has proven to be notoriously difficult to predict a priori (Zhang et al., 2021) . The relevance of generalization is rather straightforward: if one already has insight on the performance capability of a model class, this will allow for more robust models to be selected for training and deployment. But how does one begin to analyze generalization without physically training models and assessing their performance on new data thereafter? This age-old question has a rich history and is largely addressed through the notion of capacity. Loosely speaking, the capacity of a model relates to its ability to express a variety of functions (Vapnik et al., 1994) . The higher a model's capacity, the more functions it is able to fit. In the context of generalization, many capacity measures have been shown to mathematically bound the error a model makes when performing a task on new data, i.e. the generalization error (Vapnik & Chervonenkis, 1971; Liang et al., 2019; Bartlett et al., 2017) . Naturally, finding a capacity measure that provides a tight generalization error bound, and in particular, correlates with generalization error across a wide range of experimental setups, will allow us to better understand the generalization performance of machine learning models. Interestingly, through time, proposed capacity measures have differed quite substantially, with tradeoffs apparent among each of the current proposals (Jiang et al., 2019) . The perennial VC dimension has been famously shown to bound the generalization error, but it does not incorporate crucial attributes, such as data potentially coming from a distribution, and ignores the learning algorithm employed which inherently reduces the space of models within a model class that an algorithm has access to (Vapnik et al., 1994) . Arguably, one of the most promising contenders for capacity which attempts to incorporate these factors are norm-based capacity measures, which regularize the margin distribution of a model by a particular norm that usually depends on the model's trained parameters (Bartlett et al., 2017; Neyshabur et al., 2017b; 2015) . Whilst these measures incorporate the distribution of data, as well as the learning algorithm, the drawback is that most depend on the size of the model, which does not necessarily correlate with the generalization error in certain experimental setups (Zhang et al., 2021) . To this end, we present the local effective dimension which attempts to address these issues. By capturing the redundancy of parameters in a model, the local effective dimension is modified from (Berezniuk et al., 2020; Abbas et al., 2021) to incorporate the learning algorithm employed, in addition to being scale invariant and data dependent. The key results from our study can be summarized as follows: Table 1: Overview of established capacity measures and desirable properties. The first property is whether the measure can be mathematically related to the generalization error via an upper bound. The second states whether this bound is good in practice, i.e., that the measure correlates with the generalization error in various experimental setups, such as (Zhang et al., 2021) . Scale invariance corresponds to the measure being insensitive to inconsequential transformations of the model, such as multiplying a neural network's weights by a constant. Data and training dependence refers to a measure accounting for data drawn from a distribution and the learning algorithm employed. Finite data merely implies that the measure can handle finite data. Lastly, efficient evaluation refers to the possibility of estimating the capacity measure in polynomial time (in the number of data). 1 . • We prove that the local effective dimension bounds the generalization error of a trained model with finite data (see Theorem 4.1). • The local effective dimension largely depends on the Fisher information, which is often approximated in practice (Kunstner et al., 2019) . We rigorously quantify the sensitivity of the local effective dimension when evaluated with an approximated Fisher information (see Proposition 3.2). • Lastly, we empirically show that the local effective dimension correlates well with generalization error in various experimental setups using standard data sets. The local effective dimension is found to decrease in line with the generalization error as a network increases in size. Similarly, the measure increases in line with the generalization error when models are trained on randomized training labels.

2. PRELIMINARIES

In this section, we provide an overview of relevant literature and a concise introduction to generalization error bounds and the Fisher information.

2.1. RELATED WORK

We briefly discuss relevant capacity measures proposed in literature, but defer to (Jiang et al., 2019) for a more comprehensive overview. Given a model class, Vapnik et al. showed that the VC dimension can provide an upper bound on generalization error (Vapnik et al., 1994) . While this was a crucial first step in using capacity to understand generalization, the VC dimension rests on unrealistic assumptions, such as access to infinite data, and ignores things like training dependence and the fact that data, more reasonably, comes from a distribution (Holden & Niranjan, 1995) . The closely-related Rademacher complexity relaxes some of the assumptions made on the model class, but still suffers similar issues to the VC dimension (Yin et al., 2019; Wang et al., 2018) . Since then, a myriad of capacity measures aiming to circumvent these problems and provide tighter generalization error bounds, have been proposed. Margin-based capacity measures stemmed from the work of Vapnik and Chervonenkis in 1974 who pointed out that generalization error bounds based on the VC dimension may be significantly enhanced in the case of linear classifiers that produce large margins. In (Bartlett et al., 1998) , it was shown that the phenomenon where boosting models (no matter how large you make them) do not overfit data, could also be explained by the large margins these boosting models achieved. Since the

