LEARNING CURVES FOR ANALYSIS OF DEEP NET-WORKS

Abstract

A learning curve models a classifier's test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to analyze the impact of design choices, such as pretraining, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations. We also provide several interesting observations based on learning curves for a variety of image classification models.

1. INTRODUCTION

What gets measured gets optimized. We need better measures of learning ability to design better classifiers and predict the payoff of collecting more data. Currently, classifiers are evaluated and compared by measuring performance on one or more datasets according to a fixed train/test split. Ablation studies help evaluate the impact of design decisions. However, one of the most important characteristics of a classifier, how it performs with varying numbers of training samples, is rarely measured or modeled. In this paper, we refine the idea of learning curves that model error as a function of training set size. Learning curves were introduced nearly thirty years (e.g. by Cortes et al. (1993) ) to accelerate model selection of deep networks. Recent works have demonstrated the predictability of performance improvements with more data (Hestness et al., 2017; Johnson & Nguyen, 2017; Kaplan et al., 2020; Rosenfeld et al., 2020) or more network parameters (Kaplan et al., 2020; Rosenfeld et al., 2020) . But such studies have typically required large-scale experiments that are outside the computational budgets of many research groups, and their purpose is extrapolation rather than validating design choices. We find that a generalized power law function provides the best learning curve fit, while a model linear in n -0.5 , where n is the number of training samples (or "training size"), provides a good local approximation. We abstract the curve into two key parameters: e N and β N . e N is error at n = N , and β N is a measure of data-reliance, revealing how much a classifier's error will change if the training set size changes. Learning curves provide valuable insights that cannot be obtained by single-point comparisons of performance. Our aim is to promote the use of learning curves as part of a standard learning system evaluation. Our key contributions: • Investigate how to best model, estimate, characterize, and display learning curves for use in classifier analysis • Use learning curves to analyze impact on error and data-reliance due to network architecture, depth, width, fine-tuning, data augmentation, and pretraining T F 4a, 4b Increasing depth or width improves more than ensembles of smaller networks with the same number of parameters. T T 4e, 4a Data augmentation is roughly equivalent to using a m-times larger training set for some m. T T 5 Table 1 : Deep learning quiz! We encourage our readers to judge each claim as T (true) or F (false), and then compare to our guesses and results. In the results column, "T" means the experiments are consistent with the belief, "F" for inconsistent, and "?" for hard to say. 

2.1. BIAS-VARIANCE TRADE-OFF AND GENERALIZATION THEORY

The bias-variance trade-off is an intuitive and theoretical way to think about generalization. The "bias" is error due to inability of the classifier to encode the optimal decision function, and the "variance" is error due to variations in predictions due to limited availability of training samples for parameter estimation. This is called a trade-off because a classifier with more parameters tends to have less bias but higher variance. Geman et al. (1992) decompose mean squared regression error into bias and variance and explore the implications for neural networks, leading to the conclusion that "identifying the right preconditions is the substantial problem in neural modeling". This conclusion foreshadows the importance of pretraining, though Geman et al. thought the preconditions must be built in rather than learned. Domingos (2000) extends the analysis to classification. Theoretically, the mean squared error (MSE) can be modeled as e 2 test (n) = bias 2 + noise 2 + var(n), where "noise" is irreducible error due to non-unique mapping from inputs to labels, and variance can be modeled as var(n) = σ 2 /n for n training samples. The ηn -0.5 term appears throughout machine learning generalization theory. For example, the bounds based on hypothesis VC-dimension (Vapnik & Chervonenkis, 1971) and Rademacher Complexity (Gnecco & Sanguineti, 2008) are both O(cn -0.5 ) where c depends on the complexity of the classification model. More recent work also follows this form. We give some examples of bounds in Table 2 without describing all of the parameters because the point is that the test error bounds vary with training size n as a function of n -0.5 , for all approaches. 



Pre-training, even on similar domains, introduces bias that would harm performance with a large enough training set. Fine-tuning the entire network (vs. the classification layer) is only helpful if the training set is large. Increasing network depth, when fine-tuning, harms performance for small training sets, due to an overly complex model. Increasing network depth, if the backbone is frozen, is more helpful for smaller training sets than larger ones.

The learning curve measures test error e test as a function of the number of training samples n for a given classification model and learning method. Previous empirical observations suggest a functional form e test (n) = α + ηn γ , with bias-variance trade-off and generalization theories typically indicating γ = -0.5. We summarize what bias-variance trade-off and generalization theories (Sec. 2.1) and empirical studies (Sec. 2.2) can tell us about learning curves, and describe our proposed abstraction in Sec. 2.3.

Generalization bound examples: The bounds each predict generalization error increasing as a function of n -0.5 . Note: variable notation is consistent only within each line, except n.

