PREDICTING THE IMPACT OF DATASET COMPOSITION ON MODEL PERFORMANCE Anonymous authors Paper under double-blind review

Abstract

Real-world machine learning systems are often are trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple, accurate way to predict the loss incurred by a model based on data size and composition. Our work expands recent observations of log-linear generalization error and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach achieves nearly exact (r 2 > .93) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate (r 2 > .83) on more challenging machine translation and question answering tasks where baselines achieve worse-than-random performance.

1. INTRODUCTION

The success of large scale machine learning systems depends critically on the quantity and quality of data used during training, and we cannot expect these systems to succeed if there is not enough training data or if that data does not cover all the phenomena contained in the test distribution (Ben-David et al., 2010) . Knowing this, the designer of a machine learning system might create multiple sources of data, with each one targeting a different feature or domain that the model ought to do well on (Crammer et al., 2007; Wang et al., 2019a) . This data-driven design strategy provides powerful tools to improve and evaluate model behavior, but also poses an additional challenge: what is the right way to combine these various data sources? What is the optimal data collection policy for a given budget? Our goal is to answer these questions by quantifying the relationship between data sources and model performance -how well will our model do if we were to train it on n samples using a data mixture (q 1 . . . q k ) over our K data sources. A precise model for predicting model performance will allow us to both identify the optimal data collection policy and quantify cost-performance tradeoffs. The starting point of our work is the recent observation across speech, vision and text (Hestness et al., 2017; Kaplan et al., 2020; Rosenfeld et al., 2020) that the empirical performance of a model is remarkably predictable, and follows the log-linear formula log(error) ≈ -α log(n) + C. (1) In this work, we expand this observation to the multi-data-source setting and discover the surprising fact that the slope of the log-linear relationship (α) does not vary with data composition and that the data composition only affects the intercept (C). The simple dependence of log-error on data size allows us to reduce the problem of estimating model error into a learning problem. Our approach is straightforward: we hypothesize that model error follows V (n, q) := exp(-α log(n)+log(C(q))) for a simple parametric functional form C(q), and fit this to observed pairs of (n, q, error) that we obtain by subsampling the dataset and re-training a model. We show that there is a natural and simple choice of C(q) as a rational function that we derive from optimal experimental design for linear regression, M-estimation, and nonparametric Empirically, the resulting predictions are extremely accurate and hold under substantial extrapolation. On the Amazon review prediction dataset (Mansour et al., 2009) , we can learn to predict model performance nearly perfectly (r 2 = 0.96) from a small dataset of 1200 examples across 3 sources and extrapolate to predict the model error on datasets of up to 4000 examples. We show this high accuracy continues to hold on a real-world task oriented dialogue system (r 2 = 0.93), a multi-domain machine translation system (r 2 = 0.83), and boolean question answering with weak supervision (r 2 = 0.86). In each of the cases, our proposed approach substantially outperforms the best baseline, with the baselines performing worse-than-random in both the machine translation and question answering tasks.

Related work

Quantifying the effect of data composition on model performance is closely related to the classical ideas of optimal experimental design, as well as more recent machine learning methods such as active learning and data valuation. Our work will draw inspiration from the classical V -optimal experimental design (John & Draper, 1975) as a way to understand how model performance will change with the data collection policies. However, our approach differs substantially beyond this. Instead of making strong linearity assumptions and identifying closed form formulas for model performance, we treat identifying the impact of data sources on errors as itself a prediction problem, which allows us to quantify these effects for neural networks and non-separable objectives. Active learning provides methods for incrementally selecting new points to rapidly reduce a loss (Hanneke, 2007) . These approaches only consider the problem of optimal data collection and do not seek to predict model performance under all data collection strategies (including suboptimal ones), which is critical when making cost-performance tradeoffs across data sources. The model performance predictions produced in our work complements existing work on active learning by providing accurate forecasts of model performance under different data collection strategies. Finally, data valuation methods such as the Shapley value attempt to assign estimate the impact of a data source on model performance (Ghorbani & Zou, 2019; Jia et al., 2019; Ghorbani et al., 2020; Yoon et al., 2019) . These approaches are natural when pricing data sources as part of a market mechanism (Ohrimenko et al., 2019; Agarwal et al., 2019) due to the axiomatic properties of the Shapley value. Our approach differs in that we seek simply to estimate the performance of a model rather than to assign a single price to examples from a data source. This difference means that axioms such as additivity that are critical for the Shapley value are not relevant for our goal. We show that for the purpose of predicting errors, a rational function (rather than a linear cost) follows naturally from optimal experimental design. Our experiments also show that our rational function approximation provides better model performance predictions than a linear, additive model.

2. PROBLEM STATEMENT AND EMPIRICAL OBSERVATIONS

Our goal is to predict the performance of a model as a function of the number of training samples n as well as the dataset composition q, where q k represents the fraction of the training data drawn from data source k. We will now define this goal more formally in terms of the training data distribution, model fitting, and test loss. The training data consists of an n-sample training set p n,q that is created by sampling from the mixture p := k∈[K] q k p k where p k are data generating distributions for each of the K data sources and q k are mixture weights with q k ≥ 0 and k∈[K] q k = 1. Using this dataset, we learn a prediction model θ that incurs loss ( θ; x, y) for a training example (x, y). The fitted model is the empirical loss minimizer, which we define as θ(p n,q ) := arg min θ∈Θ E pn,q [ (θ; x, y)] . The performance of this classifier is evaluated on a test distribution which may differ from the training distribution by a covariate shift (i.e. p(y | x) = p test (y | x)). We are interested in model performance as a function of the data size and composition (and not a fixed empirical distribution

