ALT-MAS: A DATA-EFFICIENT FRAMEWORK FOR ACTIVE TESTING OF MACHINE LEARNING ALGO-RITHMS

Abstract

Machine learning models are being used extensively in many important areas, but there is no guarantee that a model will always perform well or as its developers intended. Understanding the correctness of a model is crucial to prevent potential failures that may have significant detrimental impact in critical application areas. In this paper, we propose a novel framework to efficiently test a machine learning model using only a small amount of labelled test data. The core idea is to efficiently estimate the metrics of interest for a model-under-test using Bayesian neural network. We develop a novel methodology to incorporate the information from the model-under test into the Bayesian neural network training process. We also devise an entropy-based sampling strategy to sample the data point such that the proposed framework can give accurate estimations for the metrics of interest. Finally, we conduct an extensive set of experiments to test various machine learning models for different types of metrics. Our experiments with multiple datasets show that given a testing budget, the estimation of the metrics by our method is significantly better compared to existing state-of-the-art approaches.

1. INTRODUCTION

Today, supervised machine learning models are employed across sectors to assist humans in making important decisions. Understanding the correctness of a model is thus crucial to avoid potential (and severe) failures. In practice, however, it is not always possible to accurately evaluate the model's correctness using the held-out training data in the development process (Sawade et al., 2010) . Consider a hospital that buys an automated medical image classification system. The supplier will provide a performance assessment, but this evaluation may not hold in this new setting as the supplier and the hospital data distributions may differ. Similarly, an enterprise that develops a business prediction system might find that the performance changes significantly over time as the input distribution shifts from the original training data. In these cases, the model performance needs to be re-evaluated as the assessments provided from the supplier or from the development process can be inaccurate. To accurately evaluate the model performance, new labelled data points from the deployment area are needed. But the process of labelling is expensive as one would usually need a large number of test instances. Thus the open question is how to test the performance of a machine learning model (model-under-test) with parsimonious use of labelled data from the deployment area. This work focuses on addressing this challenge treating the model-under-test as a black-box as in common practice one only has access to the model outputs. One previous approach aims to estimate a risk score which is a function of the model-under-test output and the ground-truth (akin to metric) using limited labelled data (Sawade et al., 2010) . However, the approach has only been shown to be tractable for some specific risk functions (e.g. accuracy). Another approach in (Gopakumar et al., 2018) suggested to search for the worst case model performance using limited labelled data, however, we posit that using worst case to assess the goodness of a model-under-test is an overkill because the worst case is often just an outlier. Recently, (Schelter et al., 2020) learns to validate the model without labelled data by generating a synthetic dataset representative of the deployment data. The restrictive assumption is that it requires domain experts to provide a set of data generators, a task usually infeasible in reality. We propose a scalable data-efficient framework that can assess the performance of a black-box model-under-test on any metric (that is applicable for black-box models) without prior knowledge from users. Furthermore, our framework can estimate multiple metrics simultaneously. The motivation for evaluating one or multiple metrics is inspired by the current practice of users who need to assess the model-under-test on one or varied aspects that are important to them. For instance, for a classification system, the user might want to solely check the overall accuracy or simultaneously check the overall accuracy, macro-precision (recall) and/or the accuracies of some classes of interest. To achieve sample efficiency, we formulate our testing framework as an active learning (AL) problem (Cohn et al., 1996) . First, a small subset of the test dataset is labelled, and a surrogate model is learned from this subset to predict the ground truth of the unlabelled data points in the test dataset. Second, an acquisition function is constructed to decide which data point in the test dataset should be chosen for labelling. The data point selected by the acquisition is sent to an external oracle for labelling, and is then added to the labelled set. The process is conducted iteratively until the labelling budget is depleted. The metrics of interest are then estimated using the learned surrogate model. With this framework, one choice is to use a standard AL method to learn a surrogate model that accurately predicts the labels of all the data points in the test dataset, however, this choice is not optimal. To efficiently estimate the metrics of interest, the surrogate model should not need to accurately predict the labels of all the data points; it only needs to accurately predict the labels of those data points that contribute significantly to the accuracy of the metric estimations. For our active testing framework, we first propose a method to train the surrogate model that can provide high metric estimation accuracy (using limited number of labelled data) by incorporating information from the model-under-test. Second, we derive an entropy-based acquisition function that can select the data points for whom labels should be acquired so as to enable maximal reduction in the estimation uncertainty of the metric of interest. We then use this computed entropy to generalize our framework to be able to work with multiple metrics. Finally, we demonstrate the efficacy of our proposed testing framework using various models-under-test and a wide range of metric sets on different datasets. In summary, our main contributions are: 1. ALT-MAS, a data-efficient testing framework that can accurately estimate the performance of a machine learning model; 2. A novel approach to train the BNN so as to accurately estimate the metrics of interest; 3. A novel sampling methodology so as to estimate the metrics of interest efficiently; and, 4. Demonstration of the empirical effectiveness of our proposed machine learning testing framework on various models-under-test for a wide range of metrics and different datasets.

2.1. PROBLEM FORMULATION

Let us assume we are given a black-box model-under-test A that gives the prediction A(x) for an input x, with A(x) ∈ C = {1, . . . , C}. Let us also assume we have access to (i) an unlabelled test dataset X = {x i } N i=1 , and, (ii) an oracle that can provide the label y x for each input x in X . Given a set of performance metrics {Q k } K k=1 , Q k : R N × R N → R, the goal is to efficiently estimate the values of these metrics, {Q * k } K k=1 , when evaluating the model-under-test A on the test dataset X . That is, we aim to estimate, Q * k = Q k (A X , Y X ), k = 1, . . . , K, with A X = {A(x)} x∈X and Y X = {y x } x∈X , using the minimal number of oracle queries. In this work, we focus on classifiers because they are common supervised learning models and also the target models of most machine learning testing papers (Zhang et al., 2019) . Besides, it is also worth noting that, as we only have access to the outputs of the black-box classifier, the metrics {Q k } must be those that can be computed using solely the classifier outputs A X and the groundtruth labels Y X of the test dataset X . Examples of Q k include the accuracy, error rate, per-class precision/recall, macro precision/recall, F β score, etc. This is to distinguish with the metrics that require information from the classifier internal structure such as the log-loss metric.

