PREDICTING THE IMPACT OF DATASET COMPOSITION ON MODEL PERFORMANCE Anonymous authors Paper under double-blind review

Abstract

Real-world machine learning systems are often are trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple, accurate way to predict the loss incurred by a model based on data size and composition. Our work expands recent observations of log-linear generalization error and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach achieves nearly exact (r 2 > .93) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate (r 2 > .83) on more challenging machine translation and question answering tasks where baselines achieve worse-than-random performance.

1. INTRODUCTION

The success of large scale machine learning systems depends critically on the quantity and quality of data used during training, and we cannot expect these systems to succeed if there is not enough training data or if that data does not cover all the phenomena contained in the test distribution (Ben-David et al., 2010) . Knowing this, the designer of a machine learning system might create multiple sources of data, with each one targeting a different feature or domain that the model ought to do well on (Crammer et al., 2007; Wang et al., 2019a) . This data-driven design strategy provides powerful tools to improve and evaluate model behavior, but also poses an additional challenge: what is the right way to combine these various data sources? What is the optimal data collection policy for a given budget? Our goal is to answer these questions by quantifying the relationship between data sources and model performance -how well will our model do if we were to train it on n samples using a data mixture (q 1 . . . q k ) over our K data sources. A precise model for predicting model performance will allow us to both identify the optimal data collection policy and quantify cost-performance tradeoffs. The starting point of our work is the recent observation across speech, vision and text (Hestness et al., 2017; Kaplan et al., 2020; Rosenfeld et al., 2020) that the empirical performance of a model is remarkably predictable, and follows the log-linear formula log(error) ≈ -α log(n) + C. (1) In this work, we expand this observation to the multi-data-source setting and discover the surprising fact that the slope of the log-linear relationship (α) does not vary with data composition and that the data composition only affects the intercept (C). The simple dependence of log-error on data size allows us to reduce the problem of estimating model error into a learning problem. Our approach is straightforward: we hypothesize that model error follows V (n, q) := exp(-α log(n)+log(C(q))) for a simple parametric functional form C(q), and fit this to observed pairs of (n, q, error) that we obtain by subsampling the dataset and re-training a model. We show that there is a natural and simple choice of C(q) as a rational function that we derive from optimal experimental design for linear regression, M-estimation, and nonparametric

