ALT-MAS: A DATA-EFFICIENT FRAMEWORK FOR ACTIVE TESTING OF MACHINE LEARNING ALGO-RITHMS

Abstract

Machine learning models are being used extensively in many important areas, but there is no guarantee that a model will always perform well or as its developers intended. Understanding the correctness of a model is crucial to prevent potential failures that may have significant detrimental impact in critical application areas. In this paper, we propose a novel framework to efficiently test a machine learning model using only a small amount of labelled test data. The core idea is to efficiently estimate the metrics of interest for a model-under-test using Bayesian neural network. We develop a novel methodology to incorporate the information from the model-under test into the Bayesian neural network training process. We also devise an entropy-based sampling strategy to sample the data point such that the proposed framework can give accurate estimations for the metrics of interest. Finally, we conduct an extensive set of experiments to test various machine learning models for different types of metrics. Our experiments with multiple datasets show that given a testing budget, the estimation of the metrics by our method is significantly better compared to existing state-of-the-art approaches.

1. INTRODUCTION

Today, supervised machine learning models are employed across sectors to assist humans in making important decisions. Understanding the correctness of a model is thus crucial to avoid potential (and severe) failures. In practice, however, it is not always possible to accurately evaluate the model's correctness using the held-out training data in the development process (Sawade et al., 2010) . Consider a hospital that buys an automated medical image classification system. The supplier will provide a performance assessment, but this evaluation may not hold in this new setting as the supplier and the hospital data distributions may differ. Similarly, an enterprise that develops a business prediction system might find that the performance changes significantly over time as the input distribution shifts from the original training data. In these cases, the model performance needs to be re-evaluated as the assessments provided from the supplier or from the development process can be inaccurate. To accurately evaluate the model performance, new labelled data points from the deployment area are needed. But the process of labelling is expensive as one would usually need a large number of test instances. Thus the open question is how to test the performance of a machine learning model (model-under-test) with parsimonious use of labelled data from the deployment area. This work focuses on addressing this challenge treating the model-under-test as a black-box as in common practice one only has access to the model outputs. One previous approach aims to estimate a risk score which is a function of the model-under-test output and the ground-truth (akin to metric) using limited labelled data (Sawade et al., 2010) . However, the approach has only been shown to be tractable for some specific risk functions (e.g. accuracy). Another approach in (Gopakumar et al., 2018) suggested to search for the worst case model performance using limited labelled data, however, we posit that using worst case to assess the goodness of a model-under-test is an overkill because the worst case is often just an outlier. Recently, (Schelter et al., 2020) learns to validate the model without labelled data by generating a synthetic dataset representative of the deployment data. The restrictive assumption is that it requires domain experts to provide a set of data generators, a task usually infeasible in reality.

