WHEN DO MODELS GENERALIZE? A PERSPECTIVE FROM DATA-ALGORITHM COMPATIBILITY Anonymous authors Paper under double-blind review

Abstract

One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent even for overparameterized linear regression (Nagarajan & Kolter, 2019). In many scenarios, their failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. To address this issue, we propose a concept named compatibility, which quantitatively characterizes generalization in a both data-relevant and algorithm-relevant manner. By considering the entire training trajectory and focusing on early-stopping iterates, compatibility exploits the data and the algorithm information and is therefore a suitable notion for generalization of overparameterized models. We validate this by theoretically studying compatibility under the setting of solving overparameterized linear regression with gradient descent. Specifically, we perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting. Our theoretical results demonstrate that in the sense of compatibility, generalization holds with significantly weaker restrictions on the problem instance than the previous at-convergence analysis.

1. INTRODUCTION

Although deep neural networks achieve great success in practice (Silver et al., 2017; Devlin et al., 2019; Brown et al., 2020) , their remarkable generalization ability is still among the essential mysteries in the deep learning community. One of the most intriguing features of deep neural networks is overparameterization, which confers a level of tractability to the training problem, but leaves traditional generalization theories failing to work. In generalization analysis, both the training algorithm and the data distribution play essential roles (Jiang et al., 2020) . For instance, a line of work (Zhang et al., 2021; Nagarajan & Kolter, 2019) highlights the role of the algorithm by showing that the algorithm-irrelevant uniform convergence bounds can become inconsistent in deep learning regimes. Another line of work (Bartlett et al., 2019; Tsigler & Bartlett, 2020) on benign overfitting emphasizes the role of data distribution via profound analysis of specific overparameterized models. Despite the significant role of data and algorithm in generalization analysis, existing theories usually focus on either the data factor (e.g., uniform convergence (Nagarajan & Kolter, 2019) and last iterate analysis (Bartlett et al., 2019; Tsigler & Bartlett, 2020) ) or the algorithm factor (e.g., stabilitybased bounds (Hardt et al., 2016) ).foot_0 Combining both data and algorithm factor into generalization analysis can help derive tighter generalization bounds and explain the generalization ability of overparameterized models observed in practice. In this sense, a natural question arises:

How to incorporate both data factor and algorithm factor into generalization analysis?

To gain insight into the interplay between data and algorithms, we provide motivating examples of a synthetic overparameterized linear regression task and a classification task on the corrupted MNIST dataset in figure 1 . In both scenarios, the final iterate with less algorithmic information, which may include the algorithm type (e.g., GD or SGD), hyperparameters (e.g., learning rate, number of epochs), generalizes much worse than the early stopping solutions (see the Blue Line). In the linear regression case, the generalization error of the final iterate can be more than ×100 larger than that of the early stopping solution. In the MNIST case, the final iterate on the SGD trajectory has 19.9% test error, much higher than the 2.88% test error of the best iterate on the GD trajectory. Therefore, the almost ubiquitous strategy of early stopping is a key ingredient in generalization analysis for overparameterized models, whose benefits have been demonstrated both theoretically and empirically (Yao et al., 2007; Ali et al., 2019; Li et al., 2020b; Ji et al., 2021) . By focusing on the entire optimization trajectory and performing data-dependent trajectory analysis, both data information and the dynamics of the training algorithm can be exploited to yield consistent generalization bounds. To analyze the data-dependent trajectory, we introduce a new concept named data-algorithmcompatibility, which jointly characterizes the role of the data and the algorithm in generalization analysis. Informally speaking, an algorithm is compatible with a data distribution if as the sample size goes to infinity, the minimum excess risk of the iterates on the training trajectory converges to zero. The significance of compatibility comes in three folds. Firstly, compatibility incorporates both data and algorithm factors into generalization analysis, and brings new messages into generalization in the overparameterization regime (see Definition 3.1). Secondly, compatibility serves as a minimal condition for generalization, without which one cannot expect to find a consistent solution via standard learning procedures. Consequently, compatibility holds with only mild assumptions and applies to a wide range of problem instances (see Theorem 4.1). Thirdly, compatibility captures the algorithmic significance of early stopping in generalization. By exploiting the algorithm information along the entire trajectory, we arrive at better generalization bounds than the at-convergence analysis (see Table 1 and 2 for examples). To theoretically validate compatibility, we study it under overparameterized linear regression setting. Analysis of the overparameterized linear regression is a reasonable starting point to study compatibility for more complex models like deep neural networks, since many phenomena of the high dimensional non-linear model are also observed in the linear regime (e.g., Figure 1 ). Furthermore, the recent neural tangent kernel (NTK) framework demonstrates that very wide neural networks trained using gradient descent with appropriate random initialization can be approximated by kernel regression in a reproducing kernel Hilbert space, which rigorously establishes a close relationship between overparameterized linear regression and deep neural network training (Jacot et al., 2018; Arora et al., 2019) .



The taxonomy of data-dependent techniques does not mean that they totally ignore all the information of algorithm, but mean that it loses some important algorithm information which makes it vacuous in generalization analysis. Similar arguments hold for algorithm-dependent techniques.



Figure 1: (a) The training plot for linear regression with spectrum λ i = 1/i 2 using GD. Note that the axes are in the log scale. (b) The training plot of CNN on corrupted MNIST with 20% label noise using SGD. Both models successfully learn the useful features in the initial phase of training, but it takes a long time for them to fit the noise in the dataset. The observations demonstrate the power of data-dependent trajectory analysis, since the early stopping solution on the trajectory generalizes well but the final iterate fails to generalize. See Appendix C for details.

