LEARNING CURVES FOR ANALYSIS OF DEEP NET-WORKS

Abstract

A learning curve models a classifier's test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to analyze the impact of design choices, such as pretraining, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations. We also provide several interesting observations based on learning curves for a variety of image classification models.

1. INTRODUCTION

What gets measured gets optimized. We need better measures of learning ability to design better classifiers and predict the payoff of collecting more data. Currently, classifiers are evaluated and compared by measuring performance on one or more datasets according to a fixed train/test split. Ablation studies help evaluate the impact of design decisions. However, one of the most important characteristics of a classifier, how it performs with varying numbers of training samples, is rarely measured or modeled. In this paper, we refine the idea of learning curves that model error as a function of training set size. Learning curves were introduced nearly thirty years (e.g. by Cortes et al. (1993) ) to accelerate model selection of deep networks. Recent works have demonstrated the predictability of performance improvements with more data (Hestness et al., 2017; Johnson & Nguyen, 2017; Kaplan et al., 2020; Rosenfeld et al., 2020) or more network parameters (Kaplan et al., 2020; Rosenfeld et al., 2020) . But such studies have typically required large-scale experiments that are outside the computational budgets of many research groups, and their purpose is extrapolation rather than validating design choices. We find that a generalized power law function provides the best learning curve fit, while a model linear in n -0.5 , where n is the number of training samples (or "training size"), provides a good local approximation. We abstract the curve into two key parameters: e N and β N . e N is error at n = N , and β N is a measure of data-reliance, revealing how much a classifier's error will change if the training set size changes. Learning curves provide valuable insights that cannot be obtained by single-point comparisons of performance. Our aim is to promote the use of learning curves as part of a standard learning system evaluation. Our key contributions: • Investigate how to best model, estimate, characterize, and display learning curves for use in classifier analysis • Use learning curves to analyze impact on error and data-reliance due to network architecture, depth, width, fine-tuning, data augmentation, and pretraining Table 1 shows validated and rejected popular beliefs that single-point comparisons often overlook. In the following sections, we investigate how to model learning curves (Sec. 2), how to estimate them (Sec. 3) , and what they can tell us about the impact of design decisions (Sec. 4), with discussion of limitations and future work in Sec. 5.

Popular beliefs Your guess

Our guess Our result

Result figures

Pre-training on similar domains nearly always helps compared to training from scratch. T T 9a, 9b, 3 Pre-training, even on similar domains, introduces bias that would harm performance with a large enough training set. T ? 3 Self-/un-supervised training performs better than supervised pre-training for small datasets. F F 3 Fine-tuning the entire network (vs. just the classification layer) is only helpful if the training set is large. T F 9a, 9b Increasing network depth, when fine-tuning, harms performance for small training sets, due to an overly complex model. T F 4a, 4b Increasing network depth, when fine-tuning, is more helpful for larger training sets than smaller ones. T F 4a, 4b Increasing network depth, if the backbone is frozen, is more helpful for smaller training sets than larger ones. T F 4a, 4b Increasing depth or width improves more than ensembles of smaller networks with the same number of parameters. T T 4e, 4a Data augmentation is roughly equivalent to using a m-times larger training set for some m. T T 5 Table 1 : Deep learning quiz! We encourage our readers to judge each claim as T (true) or F (false), and then compare to our guesses and results. In the results column, "T" means the experiments are consistent with the belief, "F" for inconsistent, and "?" for hard to say. 

2.1. BIAS-VARIANCE TRADE-OFF AND GENERALIZATION THEORY

The bias-variance trade-off is an intuitive and theoretical way to think about generalization. The "bias" is error due to inability of the classifier to encode the optimal decision function, and the "variance" is error due to variations in predictions due to limited availability of training samples for parameter estimation. This is called a trade-off because a classifier with more parameters tends to have less bias but higher variance. Geman et al. (1992) decompose mean squared regression error into bias and variance and explore the implications for neural networks, leading to the conclusion that "identifying the right preconditions is the substantial problem in neural modeling". This conclusion foreshadows the importance of pretraining, though Geman et al. thought the preconditions must be built in rather than learned. Domingos (2000) extends the analysis to classification. Theoretically, the mean squared error (MSE) can be modeled as e 2 test (n) = bias 2 + noise 2 + var(n), where "noise" is irreducible error due to non-unique mapping from inputs to labels, and variance can be modeled as var(n) = σ 2 /n for n training samples. The ηn -0.5 term appears throughout machine learning generalization theory. For example, the bounds based on hypothesis VC-dimension (Vapnik & Chervonenkis, 1971) and Rademacher Complexity (Gnecco & Sanguineti, 2008) are both O(cn -0.5 ) where c depends on the complexity of the classification model. More recent work also follows this form. We give some examples of bounds in Table 2 without describing all of the parameters because the point is that the test error bounds vary with training size n as a function of n -0.5 , for all approaches. The bounds each predict generalization error increasing as a function of n -0.5 . Note: variable notation is consistent only within each line, except n. et al. (2018) compressibility to q parameters with r discrete values O n -0.5 √ q log r 2.2 EMPIRICAL STUDIES Some recent empirical studies (e.g. Sun et al. (2017) ) claim a log-linear relationship between error and training size, but this holds only when asymptotic error is zero. Hestness et al. (2017) model error as e test (n) = α + ηn γ but often find γ much smaller in magnitude than -0.5 and suggest that poor fits indicate need for better hyperparameter tuning. This raises an interesting point that sample efficiency depends both on the classification model and on the efficacy of the optimization algorithm and parameters. Johnson & Nguyen (2017) also find a better fit with this extended power law model than by restricting γ = -0.5 or α = 0. We find that, by selecting the learning rate through validation on one training size and using the Ranger optimizer (Wright, 2019) , we can achieve a good approximate fit with γ = -0.5 and best fit with -0.3 < γ < -0.7.

Work Key Variables Bound

Neyshabur et al. (2018) network depth d O n -0.5 Bd h ln(dh)π d i=1 ||W i || 2 2 d i=1 ||Wi|| 2 F ||Wi|| 2 2 /γ Bartlett et al. (2017) spectral complexity R w O ||X||2Rw γ•n log(max i h i ) + n -0.5

Arora

In the language domain, learning curves are used in a fascinating study by Kaplan et al. (2020) . For natural language transformers, they show that a power law relationship between logistic loss, model size, compute time, and dataset size is maintained if (and only if) each is increased in tandem. We draw some similar conclusions to their study, such as that increasing model size tends to improve performance especially for small training sets (which surprised us). However, the studies are largely complementary, as we study convolutional nets in computer vision, classification error (instead of logistic loss), and a broader range of design choices such as effects across depth, width, data augmentation, pretraining source, architecture, and dataset. Also related, Rosenfeld et al. (2020) A key difference in our work is that we focus on how to best draw insights about design choices from learning curves, rather than on extrapolation. As such, we propose methods to estimate learning curves and their variance from a relatively small number of trained models.

2.3. PROPOSED CHARACTERIZATION OF LEARNING CURVES FOR EVALUATION

A classifier's performance can be characterized in terms of its error and data-reliance, or how quickly the error changes with training size n. With e(n) = α + ηn γ , we find that γ = -0.5 provides a good local approximation but that fitting γ significantly improves leave-one-size-out RMS error and extrapolation accuracy, as we detail in Sec. 4. However, α, η, and γ cannot be meaningfully compared across curves because the parameters have high covariance with small data perturbations, and comparing η values is not meaningful unless γ is fixed and vice-versa. We propose to report error and sensitivity to training size in a way that can be derived from various learning curve models and is insensitive to data perturbations. The curve is characterized by error e N = α + ηN γ and data-reliance β N at N , and we typically choose N as the full dataset size. Noting that most learning curves are locally well approximated by a model linear in n -0.5 , we compute data-reliance as β N = N -0.5 ∂e ∂n -0.5 n=N = -2ηγN γ . When the error is plotted against n -0.5 , β N is the slope at N scaled by N -0.5 , where the scaling was chosen to make the practical implications of β N more intuitive. This yields a simple predictor for error when changing training size by a factor of d: e(d • N ) = e N + 1 √ d -1 β N . Thus, by this linearized estimate, asymptotic error is e Nβ N , a 4-fold increase in data (e.g. 400 → 1600) reduces error by 0.5β N , and using only one quarter of the dataset (e.g. 400 → 100) increases the error by β N . For two models with similar e N , the one with a larger β N would outperform with more data but underperform with less. Note that (e N , β N , γ) is a complete re-parameterization of the extended power law, with γ + 0.5 indicating the curvature in n -0.5 scale.

3. ESTIMATING LEARNING CURVES

We now describe the method for estimating the learning curve from error measurements with confidence bounds on the estimate. Let e ij denote the random variable corresponding to test error when the model is trained on the j th fold of n i samples (either per class or in total). We assume {e ij } Fi j=1 are i.i.d according to N (µ i , σ 2 i ). We want to estimate learning curve parameters α (asymptotic error), η, and γ, such that e ij = α + ηn γ i + ij where ij ∼ N (0, σ 2 i ) and µ ij = E[e ij ] = µ i . Sections 3.1 and 3.2 describe how to estimate mean and variance of α and η for a given γ, and Sec. 3.3 describes our approach for estimating γ.

3.1. WEIGHTED LEAST SQUARES FORMULATION

We estimate learning curve parameters {α, η} by optimizing a weighted least squares objective: G(γ) = min α,η S i=1 Fi j=1 w ij (e ij -α -ηn γ ) 2 where w ij = 1/(F i σ 2 i ) . F i is the number of models trained with data size n i and is used to normalize the weight so that the total weight for observations from each training size does not depend on F i . The factor of σ 2 i accounts for the variance of ij . Assuming constant σ 2 i and removing the F i factor would yield unweighted least squares. The variance of the estimate of σ 2 i from F i samples is 2σ 4 i /F i , which can lead to over-or underweighting data for particular i if F i is small. Recall that each sample e ij requires training an entire model, so F i is always small in our experiments. We would expect the variance to have the form σ 2 i = σ 2 0 + σ2 /n i , where σ 2 0 is the variance due to random initialization and optimization and σ2 /n i is the variance due to randomness in selecting n i samples. Indeed, by averaging over the variance estimates for many different network models on the CIFAR-100 (Krizhevsky, 2012) dataset, we find a good fit with σ 2 0 = 0.2. This enables us to estimate a single σ2 parameter from all samples e in a given learning curve as a least squares fit and also upper-bounds w ij <= 5 even if two models happen to have the same error. This attention to w ij may seem fussy, but without such care we find that the learning curve often fails to account sufficiently for all the data in some cases.

3.2. SOLVING FOR LEARNING CURVE MEAN AND VARIANCE

Concatenating errors across dataset sizes (indexed by i) and folds results in an error vector e of dimension D = S i=1 F i . For each d ∈ {1, • • • , D}, e[d] is an observation of error at dataset size n i d that follows N (µ i d , σ 2 i d ) with i d mapping d to the corresponding i. The weighted least squares problem can be formulated as solving a system of linear equations denoted by W 1/2 e = W 1/2 Aθ, where W ∈ R D×D is a diagonal matrix of weights W dd = w d , A ∈ R D×2 is a matrix with A[d, :] = [1 n γ d ], and θ = [α η] T are the parameters of the learning curve, treating γ as fixed for now. The estimator for the learning curve is then given by θ = (W 1/2 A) + W 1/2 e = M e, where M ∈ R 2×D and + is pseudo-inverse operator. We compute a mean curve using θ = E[ θ] = M E[e] = M µ (3) where µ ∈ R D with µ[d] = μi d computed

by empirical estimate as

Fi j=1 e ij /F i . The covariance of the estimator is given by Σ θ = M Σ e M T where Σ θ ∈ R 2×2 and Σ e ∈ R D×D is the covariance of e, where Σ e [d 1 , d 2 ] = σ 2 i d 1 if i d1 = i d2 and 0 otherwise. We compute our empirical estimate of σ 2 i as described in Sec. 3.1. Since the curve is given by e(n) = [1 n γ ] θ, the mean curve can be computed as e(n) = [1 n γ ] θ = α + ηn γ . (5) The 95% bounds at any n can be computed as e(n) ± 1.96 × σ(n) with σ2 (n) = [1 n γ ] Σ θ 1 n γ (6) where α and η are the empirical estimates of α and η. These confidence bounds reflect the variance in error measurements, assuming the parameterization is capable of fitting the true mean.

3.3. ESTIMATING γ

We search for γ that minimizes the weighted least squares objective with an L1-prior that slightly encourages values close to 0.5. Specifically, we solve min γ∈(-1,0) G(γ) + λ|γ + 0.5| by searching over γ ∈ {-0.99, ..., -0.01} with λ = 5 for our experiments.

4. EXPERIMENTS

We describe our implementation details in Sec. 4.1, apply learning curves to gain insights about error and data-reliance in Sec. 4.2, and validate our choice of learning curve parameterization and fitting weights used in the least squares objective in Sec. 4.3.

4.1. IMPLEMENTATION DETAILS

We use Pytorch-Lightning (Falcon, 2019) for our implementation with various architectures, weight initializations, data augmentation, and linear or fine-tuning optimization. Training: We train models with images of size 32×32 for CIFAR (Krizhevsky, 2012) and 224×224 for Places365 (Zhou et al., 2017) with a batch size of 64 (except for Wide-ResNet101 and Wide-ResNeXt101, where we use a batch size of 32 and performed one optimizer step every two batches). For each experiment setting, we conduct a learning rate search on a subset of the training data and choose the learning rate with the highest validation accuracy, and use it for all other subsets. We determine each fold's training schedule on a mini-train/mini-val split of 2:1 on the train set. Each time the mini-val error stops decreasing for some epochs ("patience"), we revert to the best epoch and decrease the learning rate to 10%, and we perform this twice. Then we use this optimal mini-train learning rate schedule and ending epoch to train on the whole fold. The patience is ∝ 1/ √ n, and is 5 at the n = 400 samples/class for CIFAR100/Places365 and 15 at the largest training size for other smaller datasets. We use a weight decay value of 0.0001. We use the Ranger optimizer (Wright, 2019) , which combines Rectified Adam (Liu et al., 2019) training examples per class and {16, 8, 4, 3, 3, 1} models each. For other datasets (Fig. 6 ), we use {20%, 40%, 80%} of the full data and train {4, 2, 1} models each. We hold out 20% of data from the original training set for testing (a validation set could also be used if available) to discourage metafitting on the test set. For example, we hold out 100 samples per class from the original CIFAR100 training set and perform hyperparameter selection and training on the remaining 400 samples. Pretraining: When pretraining is used, we initialize models with pretrained weights learned through supervised training on ImageNet or Places365, or MOCO self-supervised training on Ima-geNet (He et al., 2020) . Otherwise, weights are randomly initialized with Kaiming initialization. Data Augmentation: For CIFAR, we pad by 4 pixels and use a random 32 × 32 crop (test without augmentation), and for Places365 we use random-sized crop (Szegedy et al., 2015) to 224×224 and random flipping (center crop 224×224 test time). For remaining datasets, we follow the preprocessing in Zhai et al. (2020) that produced the best results when training from scratch. Linear vs. Fine-tuning: For "linear", we only train the final classification layer, with the other weights frozen to initialized values. All weights are trained when "fine-tuning". We plot the fitted learning curves and confidence bounds, with observed test errors as black points. The legend displays γ, error e N , and data reliance β N with N = 400. The x-axis is in scale n -0.5 (n in parentheses), but bestfitting γ is used in all cases.In all plots, n denotes number of samples per class except Fig. 6 where n is the total number of samples. A vertical bar indicates n = 1600, which we consider the limit of accurate extrapolation from curves fit to n ≤ 400 samples. All points are used for fitting, except in Fig. 9b n = 1600 is held out to test extrapolation. Network architecture: Advances in CNN architectures have reduced number of parameters while also reducing error over the range of training sizes. On CIFAR100, AlexNet has 61M parameters; VGG-16, 138M; ResNet-50, 26M; ResNeXt-50, 25M; and ResNet-101, 45M. Fig. 1 shows that each major advance through ResNet reduces both data-reliance and e 400 , while ResNeXt appears to slightly reduce e 400 without change to data-reliance. Pretraining and fine-tuning: In Fig. 2 we see that, for linear classifiers, pretraining leads to a huge improvement in e 400 though with a moderate increase in data-reliance. When fine-tuning, the pretraining greatly reduces datareliance β 400 and also reduces e 400 . Pretraining clearly improves performance with smaller training sizes. However, we cannot draw conclusions about bias because extrapolated asymptotic error is not reliable, and the full story is complicated. On an object detection task, He et al. (2019) find that, with long learning schedules, randomly initialized networks approach the performance of pretrained networks (for the CNN backbone), even with finite data. Experiments by Zoph et al. (2020) , also on object detection, show that pretraining can sometimes harm performance when strong data augmentation is used. Ko-rnblith et al. (2019) show that fine-tuned pretrained models outperform randomly initialized models on many datasets, but the gap is often small and narrows as data size grows. All agree that pretraining is at least important for providing a warm start that greatly reduces the training time, but whether it introduces bias (i.e. asymptotic error) likely depends on the tasks, domains, and optimization settings. Pretraining data sources: In Fig. 3 , we compare pretraining strategies: random, supervised on ImageNet or Places365, and self-supervised on ImageNet (MOCO by He et al. (2020) ). All initializations have similar extrapolated error at n = 1600, but different data-reliance. Self-supervised MOCO leads to lower e 400 and β 400 compared to Places365 pretraining. Supervised pretraining on ImageNet has the lowest e 400 and β 400 . We suspect that the γ = -0.67 and higher extrapolated asymptotic error may be due to measurement noise and suboptimal hyperparameter selection. Network depth, width, and ensembles: The classical view is that smaller datasets need simpler models to avoid overfitting. In Figs. 4a, 4b , we show that, not only do deeper networks have better potential at higher data sizes, their data reliance does not increase (nearly parallel and drops a little for fine-tuning), making deeper networks perfectly suitable for smaller datasets. For linear classifiers (Fig. 4b ), the deeper networks provide better features, leading to consistent drop in e 400 . The small jump in data reliance between Resnet-34 and Resnet-50 may be due to the increased last layer input size from 512 to 2048 nodes. When increasing width, the fine-tuned networks (Fig. 4c ) have reduced e 400 without much change to data-reliance. With linear classifiers (Fig. 4d ), increasing the width leads to little change or even increase in e 400 with slight decrease in data-reliance. Rosenfeld et al. (2020) show that error can be modeled as a function of either training size, model size, or both. Modeling both jointly can provide additional capabilities such as selecting model size based on data size, but requires many more experiments to fit the curve. An alternative to using a deeper or wider network is forming ensembles. In Figure 4e , we find that while an ensemble of six improves over a single model, it has higher e 400 and data-reliance than ResNet-101 (44.5M), Wide-ResNet-50 (68.9M), and Wide-ResNet-101 (126.9M). Three ResNet-50's (each 25.6M) underperforms Wide-ResNet-50 on e 400 but outperforms for small amounts of data due to lower data reliance. Data Augmentation: One may expect that data augmentation acts as a regularizer with reduced effect for large training sizes, or even possibly negative effect due to introducing bias. However, Fig. 5 shows that data augmentation on Places365 reduces error for all training sizes with little or no change to data-reliance when fine-tuning. e(n) with augmentation roughly equals e(1.8n) without it, supporting the view that augmentation acts as a multiplier on the value of an example. For the linear classifier, data augmentation has little apparent effect due to low data-reliance, but the results are still consistent with this multiplier. 2020)), comparing fine-tuned vs. linear with Resnet-18. For these plots only, n is the total number of samples. The γ values are estimated from data, but the prior has more effect here due to fewer error measurements. We see fine-tuning consistently outperforms linear, though the difference is most dramatic for Sun397. Pretraining provides large benefits across datasets.

4.3. EVALUATION OF LEARNING CURVES MODEL AND FITTING

We validate our learning curve model using leave-one-size-out prediction error, for example, predicting empirical mean performance with 400 samples per class based on observing error from models trained on 25, 50, 100, and 200 samples. We consider various choices of weighting schemes (w's in Eq. 2) and estimating different parameters in a general form of the learning curve given by e(n) = α + ηn γ + δn 2γ . Note that setting δ = 0 yields the learning curve model described in Sec. 3. Weighting Schemes. In the Fig. 7 table, we compare three weighting schemes across 16 classifiers: w ij = 1 is unweighted; w ij = 1/σ 2 i is weighted by estimated size-dependent standard deviation; w ij = 1/(F i σ 2 i ) makes the total weight for a given dataset size invariant to the number of folds. On average our proposed weighting performs best with high significance compared to unweighted. The p-value is paired t-test of difference of means calculated across all dataset sizes. Model Choice. We consider other parameterizations that are special cases of e(n) = α+ηn γ +δn 2γ . The table in Fig. 7 shows that the parameterization used for our experiments outperforms the others, in most cases with high significance, and achieves a very good fit with R 2 of 0.998. Model Stability. We test stability and sample requirements by repeatedly fitting curves to four resampled data points for a model (Resnet-18, no pretraining, fine-tuned, tested on Places365). Based on estimates of mean and standard deviation, one point each at n = {50, 100, 200, 400} is sampled and used to fit a curve, repeated 100 times. Parentheses in legend show standard deviation of estimates of e N , β N , and γ. Our preferred model extrapolates best to n = 1600 and n = 25 while retaining stable estimates of of e N and β N , but predicted asymptotic error α varies widely. Appendix C shows similar estimates of e N and β N by fixing γ = -0.5 and fitting only α and η on the three largest sizes (typically n = {100, 200, 400}), indicating that a lightweight approach of training a few models can yield similar conclusions. 

5. LIMITATIONS AND FUTURE WORK

Limitations: Our work in this paper is limited to classification loss, and our model does not account for small sample effects where chance performance is a major factor. Although our proposed e N and β N are stable under perturbations and different learning curve parameterizations, the asymptotic error α and exponent γ parameters of the learning curve are unstable, and our confidence interval does not account for γ variance. Unstable α means that little can be concluded about asymptotic performance, though e Nβ N can stand in as a measure of large-data performance. Unstable γ may mean that conclusions are subject to the hyperparameter selection and optimization method. Future work: Do the hyperparameters such as learning rate, schedule, and weight decay determine γ, or something else? It appears that γ < -0.5 is accompanied by high α and/or η. Should γ = -0.5 for a well-trained system? Answering these questions could lead to improved training and evaluation methodologies. It would also be interesting to investigate learning curve models for small training size, other losses and prediction types, more design parameters and interactions, and impact of imbalance in class distribution. Appendix A offers extended discussion. Appendix B provides a guide to fitting, displaying, and using learning curves. Appendix C contains a table of learning curves for all of our experiments and compares e N and β N produced by two learning curve models. of error and data-reliance, we can provide a more complete understanding of model design and training size impact than single-point error. With that perspective, we discuss the limitations of our experiments and directions for future work. • Cause and impact of γ: We speculate that γ is largely determined by hyperparameters and optimization rather than model design, so conclusions are conditioned on particular training parameters. This presents an opportunity to identify poor training regimes and improve them. Intuitively, one would expect that more negative γ values are better (i.e. γ = -1 preferable to γ = -0.5), since the curve is curve O(n γ ), but we find the highmagnitude γ tends to come with high asymptotic error, indicating that the efficiency comes at cost of over-commitment to initial conditions. We speculate (but with some disagreement among authors) that γ = -0.5 is an indication of a well-trained curve and will generally outperform curves with higher or lower γ, given the same classification model. It would be interesting to examine the impact of hyperparameter selection and optimization method on γ. • Small training sets: Error is bounded and, with small training sets, classifier performance may be modeled as transitioning from random guess to informed prediction, as shown by Rosenfeld et al. (2020) . We do not model performance with very small training size, partly to keep our model simple, partly because small training performance can be easily measured empirically, and partly because performance with small training size is highly variable depending on the sample. However, studying performance with small training sizes could be interesting, particularly to determine whether design decisions have an impact at the small size that is not apparent at larger sizes. • Losses and Prediction types: We analyze multiclass classification error, but the same analysis could likely be extended to other losses and prediction types. For example, Kaplan et al. ( 2020) analyze learning manifolds of cross-entropy loss, which is unbounded, of language model transformers. Problems like object detection or grounding sometimes have relatively complex evaluation measures, such as average precision after accounting for localization and label accuracy, but test evaluation of the same losses used for training should still apply. Sun et al. (2017) show an approximately log-linear behavior between mean intersection of union semantic segmentation error as a function of number of training samples. • More design parameters and interactions: The interaction between data scale, model scale, and performance is well-explored by Kaplan et al. (2020) and Rosenfeld et al. (2020) , but it could also be interesting to explore interactions, e.g. between class of architecture (e.g. VGG, ResNet, EfficientNet (Tan & Le, 2019) ) and some design parameters, to see how ideas such as skip-connections, residual layers and creating bottlenecks influence performance. More extensive evaluation of data augmentation, representation learning, optimization and regularization methods would also be interesting. • Unbalanced class distributions: In most of our experiments, we use equal number of samples per class. Further experimentation is required to determine whether class imbalance impacts the form of the learning curve.

B USER'S GUIDE TO LEARNING CURVES B.1 USES FOR LEARNING CURVES

• Comparison: When comparing two learners, measuring the error and data-reliance provides a better understanding of the differences than evaluating single-point error. We compare curves with e N and β N , rather than directly using the curve parameters, because they are more stable under data perturbations and do not depend on the parameterization, instead corresponding to error and rate of change about n = N . e Nβ N can be used as a measure of large-sample performance. • Performance extrapolation: A 10x increase in training data can require a large investment, sometimes millions of dollars. Learning curves can predict how much performance will improve with the additional data to judge whether the investment is worthwhile. • Model selection: When much training data is available, architecture, hyperparameters, and losses can be designed and selected using a small subset of the data to minimize the extrapolated error of the full training set size. Higher-parameter models such as in Kaplan et al. (2020) and Rosenfeld et al. (2020) may be more useful as a mechanism to simultaneously select scale parameters and extrapolate performance, though fitting those models is much more computationally expensive due to the requirement of sampling error/loss at multiple scales and data sizes. • Hyperparameter validation: A poor fitting learning curve (or one with γ far from -0.5) is an indication of poor choice of hyperparameters, as pointed out by Hestness et al. (2017) .

B.2 ESTIMATING AND DISPLAYING LEARNING CURVES

Use validation set: We recommend computing learning curves on a validation set, rather than a test set, according to best practice of performing a single evaluation on the test set for the final version of the algorithm. All of our experiments are on a validation set, which is carved from the official training set if necessary. Generate at least four data points: In most of our experiments on CIFAR100, we train a 31 models: 1 on 400 images, 2 on 200 images, 4 on 100 images, 8 on 50 images, and 16 on 25 images. Each trained model provides one data point, the average validation error. In each case, the training data is partitioned so that the image sets within the same size are non-overlapping. Training multiple models at each size enables estimating the standard deviation for performing weighted least squares and producing confidence bounds. However, our experiments indicate that learning curves are highly stable, so a minimal experiment of training four models on the full, half, quarter, and eighth-size training set may be sufficient as part of a standard evaluation. See Fig. 8 It may be necessary to train more models if attempting to distinguish fine differences. , n = 400, using all the data points shown as white circles. Then, 100 times, we sample one point each from a Guassian distribution and fit a learning curve to the four points. In parantheses, the legend shows the standard deviation of e N , β N , and γ. Note that the parameterization of {α, η, γ} extrapolates best to lower and higher data sizes while still producing stable estimates of e N and β N . Asymptotic error, however, varies widely. Set hyperparameters: The learning rate and learning schedule are key parameters to be set. We have not experimented with changes to weight decay, momentum, or other hyperparameters. Fit learning curves: If more than one data point is available for the same training size, the standard deviation can be estimated. As described in Sec. 3, we recommend fitting a model of σ 2 i = σ 2 0 + σ2 /n, where σ 2 0 . σ 2 0 is the variance due to randomness in initialization and optimization. The fitting is not highly sensitive to this parameter, so we recommend setting σ 2 0 = 0.01 and fitting σ to observations, since estimating both from experiments to generate a single learning curve introduces high variance and instability.



Bousquet & Elisseeff (2002) based on analysis of stability with margin γ O n -0.5 /γ



) = α + η • n γ AlexNet (e400 = 26.89; β400 = 12.4; γ = -0.32) VGG-16(bn) (e400 = 19.1; β400 = 9.79; γ = -0.38) ResNet-50 (e400 = 18.33; β400 = 4.32; γ = -0.67) ResNeXt-50(32x4d) (e400 = 16.45; β400 = 4.91; γ = -0.65) ResNet-101 (e400 = 14.97; β400 = 5.03; γ = -0.62)

Figure 1: Architecture (w/ finetuning)

) = α + η • n γ No Pretr; Linear (e400 = 79.29; β400 = 1.32; γ = -0.84) No Pretr; Finetune (e400 = 27.91; β400 = 18.23; γ = -0.41) Pretr; Linear (e400 = 32.42; β400 = 5.61; γ = -0.35) Pretr; Finetune (e400 = 18.86; β400 = 7.28; γ = -0.57) ) = α + η • n γ No Pretr; Linear (e400 = 92.79; β400 = 0.96; γ = -0.5) No Pretr; Finetune (e400 = 57.89; β400 = 12.86; γ = -0.26) Pretr; Linear (e400 = 59.91; β400 = 4.16; γ = -0.38) Pretr; Finetune (e400 = 54.0; β400 = 7.33; γ = -0.28) (b) Transfer: Imagenet to Places365

Figure 2: Pretraining and fine-tuning with ResNet-18.

Figure 4: Depth, width, and ensembles on Cifar100.

Figure 5: Data augmentation on Places365.

Figure 6: Additional datasets

Figure 7: Learning curve model and validation. See text for explanation.

Figure 8: Stability under sparse measurements: Sampled learning curves for Places365 fine-tuned without pretraining are shown for four different learning curve parameterizations. In each case, means and standard deviations (shown by error bars) are estimated for n = 50, n = 100, n = 200, n = 400, using all the data points shown as white circles.Then, 100 times, we sample one point each from a Guassian distribution and fit a learning curve to the four points. In parantheses, the legend shows the standard deviation of e N , β N , and γ. Note that the parameterization of {α, η, γ} extrapolates best to lower and higher data sizes while still producing stable estimates of e N and β N . Asymptotic error, however, varies widely.

Generalization bound examples:

model error as a function of both training size and number of model parameters with a five-parameter function that accounts for training size, model parameter size, and chance performance.

Number of Training Examples: To compute learning curves for CIFAR and Places365, we vary the number of training examples per class, partition the train set, and train one model per partition. For CIFAR100 (Krizhevsky, 2012), we use {25, 50, 100, 200, 400} training examples per class, and the number of models trained for each respectively is {16, 8, 4, 2, 1}. Similar to Hestness et al. (2017), we find training sizes smaller than 25 samples per class are strongly influenced by bounded error and deviate from our model. For Places365 dataset, we use {25, 50, 100, 200, 400, 1600}

6. CONCLUSION

We investigate learning curve models for analyzing classifier design decisions. We find an extended power law provides the best fit across many different architectures, datasets, and other design parameters. We propose to characterize error and data-reliance with e N and β N , which are stable under data perturbations and can be derived from different learning curve models. Our experiments lead to several interesting observations about impacts of pretraining, fine-tuning, data augmentation, depth, width, and ensembles. We anticipate learning curves can further inform training methodology, continual learning, and representation learning, among other problems, and hope to see learning curves become part of a standard classification evaluation.

A EXTENDED DISCUSSION

Evaluation methodology is the foundation of research, impacting how we choose problems and rank solutions. Large train and test sets now serve as the fuel and crucible to refine machine learning methods. The current evaluation standard of using fixed i.i.d. train/test sets has supported many classification model improvements, but as machine learning broadens to continual learning, representation learning, long-tail learning, and so on, we need evaluation methods that better reflect the uncontrollable, unpredictable, and ever-changing world. By characterizing performance in terms Display learning curves or parameters: As in this paper, learning curves can be plotted linearly with the x-axis as n -0.5 and the y-axis as error. We choose this rather than log-linear because it reveals prediction of asymptotic error and yields a linear plot when γ = -0.5. Since space is often a premium, the learning curve parameters can be displayed instead, as illustrated in Table 3 . However, if the classifier parameters are functions of n, then γ may deviate from -0.5. For example, Tsybakov (2008) shows that a kernel density estimator (KDE) with fixed bandwidth h has MSE bounded by O( 1nh ), but when the bandwidth is set as a function of n to minimize MSE, the bound becomes O(n -2β 2β+1 ) where β is the kernel order. In our experiments, all aspects of our model are fixed across training size when estimating one learning curve, except learning schedule, but it should be noted that error bounds and likely the learning curve parameters depend on both the classifier form and which parameters vary with n.

D.3 OTHER PLANNED IMPROVEMENTS

• Experiments to include WRN-28-10 (or similarly effective Wide ResNet model) on Cifar-100 to show that learning curve methodology applies and experimental findings hold for high-performing models • Discussion to clarify that experiments serve to exemplify use of learning curves and make interesting observations, but more extensive study of each design parameter is warranted. Also discuss any other concerns/limitations raised by reviewers.

