SYNG4ME: MODEL EVALUATION USING SYNTHETIC TEST DATA

Abstract

Model evaluation is a crucial step in ensuring reliable machine learning systems. Currently, predictive models are evaluated on held-out test data, quantifying aggregate model performance. Limitations of available test data make it challenging to evaluate model performance on small subgroups or when the environment changes. Synthetic test data provides a unique opportunity to address this challenge; instead of evaluating predictive models on real data, we propose to use synthetic data. This brings two advantages. First, supplementing and increasing the amount of evaluation data can lower the variance of model performance estimates compared to evaluation on the original test data. This is especially true for local performance evaluation in low-density regions, e.g. minority or intersectional groups. Second, generative models can be conditioned as to induce a shift in the synthetic data distribution, allowing us to evaluate how supervised models could perform in different target settings. In this work, we propose SYNG4ME: an automated suite of synthetic data generators for model evaluation. By generating smart synthetic data sets, data practitioners have a new tool for exploring how supervised models may perform on subgroups of the data, and how robust methods are to distributional shifts. We show experimentally that SYNG4ME achieves more accurate performance estimates compared to using the test data alone.

1. INTRODUCTION

For machine learning (ML) to be truly useful in safety-critical and high-impact areas such as medicine or finance, it is crucial that models are rigorously audited and evaluated. Failure to perform rigorous testing could result in models at best failing unpredictably and at worst leading to silent failures. There are many examples, such as models that perform unexpectedly on certain subgroups (Oakden-Rayner et al., 2020; Suresh & Guttag, 2019; Cabrera et al., 2019b; a) or models not generalizing across domains due to distributional mismatches (Pianykh et al., 2020; Quinonero-Candela et al., 2008; Koh et al., 2021) . Understanding such model limitations is vital to imbue trust in ML systems, as well as guide user understanding as to the conditions in which the model can be safely and reliably used. Many mature industries involve standardized processes to evaluate performance under a variety of testing and/or operating conditions (Gebru et al., 2021) . For instance, automobiles make use of wind tunnels and crash tests to assess specific components, whilst electronic component datasheets outline conditions where reliable operation is guaranteed. Unfortunately, current approaches to characterize performance of supervised ML models do not have the same level of detail and rigor. Instead, the prevailing approach in ML is to evaluate only using average prediction performance on a hold-out test set. Average performance on a test set from the same underlying distribution has two clear disadvantages. (1) No insight into granular performance: by treating all samples equally, we may miss performance differences for smaller subgroups. Even if we would decide to evaluate performance on a more granular level, we may not have enough real test data to get an accurate evaluation for small subgroups. (2) Ignores distributional shifts: the world is constantly evolving, hence the setting of interest may not have the same data distribution as the test set. This typically leads to overestimated real-world performance (Patel et al., 2008; Recht et al., 2019) . The community has attempted to address both of these issues. ( 1 



Granular performance: Ribeiro et al. (2020) and Röttger et al. (2021) propose using model behavioral testing methods for model evaluation -which manually craft tests of specific model use-cases, and (2) Distributional shifts:

