SYNG4ME: MODEL EVALUATION USING SYNTHETIC TEST DATA

Abstract

Model evaluation is a crucial step in ensuring reliable machine learning systems. Currently, predictive models are evaluated on held-out test data, quantifying aggregate model performance. Limitations of available test data make it challenging to evaluate model performance on small subgroups or when the environment changes. Synthetic test data provides a unique opportunity to address this challenge; instead of evaluating predictive models on real data, we propose to use synthetic data. This brings two advantages. First, supplementing and increasing the amount of evaluation data can lower the variance of model performance estimates compared to evaluation on the original test data. This is especially true for local performance evaluation in low-density regions, e.g. minority or intersectional groups. Second, generative models can be conditioned as to induce a shift in the synthetic data distribution, allowing us to evaluate how supervised models could perform in different target settings. In this work, we propose SYNG4ME: an automated suite of synthetic data generators for model evaluation. By generating smart synthetic data sets, data practitioners have a new tool for exploring how supervised models may perform on subgroups of the data, and how robust methods are to distributional shifts. We show experimentally that SYNG4ME achieves more accurate performance estimates compared to using the test data alone.

1. INTRODUCTION

For machine learning (ML) to be truly useful in safety-critical and high-impact areas such as medicine or finance, it is crucial that models are rigorously audited and evaluated. Failure to perform rigorous testing could result in models at best failing unpredictably and at worst leading to silent failures. There are many examples, such as models that perform unexpectedly on certain subgroups (Oakden-Rayner et al., 2020; Suresh & Guttag, 2019; Cabrera et al., 2019b; a) or models not generalizing across domains due to distributional mismatches (Pianykh et al., 2020; Quinonero-Candela et al., 2008; Koh et al., 2021) . Understanding such model limitations is vital to imbue trust in ML systems, as well as guide user understanding as to the conditions in which the model can be safely and reliably used. Many mature industries involve standardized processes to evaluate performance under a variety of testing and/or operating conditions (Gebru et al., 2021) . For instance, automobiles make use of wind tunnels and crash tests to assess specific components, whilst electronic component datasheets outline conditions where reliable operation is guaranteed. Unfortunately, current approaches to characterize performance of supervised ML models do not have the same level of detail and rigor. Instead, the prevailing approach in ML is to evaluate only using average prediction performance on a hold-out test set. Average performance on a test set from the same underlying distribution has two clear disadvantages. (1) No insight into granular performance: by treating all samples equally, we may miss performance differences for smaller subgroups. Even if we would decide to evaluate performance on a more granular level, we may not have enough real test data to get an accurate evaluation for small subgroups. (2) Ignores distributional shifts: the world is constantly evolving, hence the setting of interest may not have the same data distribution as the test set. This typically leads to overestimated real-world performance (Patel et al., 2008; Recht et al., 2019) . The community has attempted to address both of these issues. ( 1 Both approaches are largely manual, labor-intensive processes. This raises the following question: (iii) (ii) (i) (P1) Reliable granular evaluation (i) (ii) (iii) Can we define a model evaluation framework where the evaluation is both low-effort and customizable to new datasets such that practitioners can evaluate their trained ML model(s) under a variety of conditions, customized for their specific tasks and datasets of choice? With the above in mind, our goal is to build a model evaluation framework with the following desired properties (P1-P2), motivated by practical, real-world ML failure cases (cited below): (P1) Reliable granular evaluation: we want to accurately evaluate predictive model performance on a granular level, even for regions with few test samples. For example, evaluating performance on (i) subgroups (Oakden-Rayner et al., 2020; Suresh et al., 2018; Goel et al., 2020; Cabrera et al., 2019a; b) , (ii) samples of interest (Deo, 2015; Savage, 2012; Nezhad et al., 2017) and (iii) low-density regions (Saria & Subbaswamy, 2019; D'Amour et al., 2020; Cohen et al., 2021) . (P2) Sensitivity to distributional shifts: we want to accurately evaluate model performance and sensitivity when the deployment distribution is different, as this often leads to model degradation (Pianykh et al., 2020; Quinonero-Candela et al., 2008; Koh et al., 2021) . Contributions. This paper makes the following contributions. 1. Practical model evaluation framework: We propose Synthetic Data Generation for Model Evaluation (SYNG4ME, pronounced: "sing for me"): an evaluation framework for characterizing ML model performance for both (P1) reliable granular evaluation and (P2) sensitivity to distributional shifts (Sec. 4). At its core, SYNG4ME uses generative models to create synthetic test sets for model evaluation. For example, as illustrated in Fig. 1 we can generate larger test sets for small subgroups or test sets with shifts in distribution. To the best of our knowledge, this is the first work that focuses on synthetic data for evaluating supervised models. 2. Accurate granular performance evaluation: We find that the use of synthetic test data provides a more accurate estimate of the true performance on small subgroups compared to just using the small (real) test set alone (Sec. 5.1). This is especially true when for evaluating performance on minority and intersectional subgroups, for which we introduce the intersectional model performance matrix (Fig. 4 ). 3. Quantifying model sensitivity to distributional shifts: We show how synthetic test data is able to quantify predictive model performance changes as a result of common distributional shifts (defined in Sec. 3), both in terms of model sensitivity across the operating range (Sec. 5.2.1) and with only high-level knowledge of the shift (Sec. 5.2.2).



) Granular performance: Ribeiro et al. (2020) and Röttger et al. (2021) propose using model behavioral testing methods for model evaluation -which manually craft tests of specific model use-cases, and (2) Distributional shifts:

P2) Sensitivity to distributional shifts Figure1: SYNG4ME is a framework for evaluating model performance using synthetic data generators. It has three phases: training the generative model, generating synthetic data and model evaluation. Firstly, SYNG4ME enables (P1) reliable granular evaluation when there is (i) limited real test data in small subgroups, by (ii) generating synthetic data conditional on subgroup information X c , thereby (iii) permitting more reliable model evaluation even on small subgroups. Secondly, SYNG4ME enables assessment of (P2) sensitivity to distributional shifts when (i) the real test data does not reflect shifts, by (ii) generating synthetic data conditional on marginal shift information of features X c , thereby (iii) quantifying model sensitivity to distributional shift. Required inputs are denoted in yellow.

