WINDOW-BASED DISTRIBUTION SHIFT DETECTION FOR DEEP NEURAL NETWORKS

Abstract

To deploy and operate deep neural models in production, the quality of their predictions, which might be contaminated benignly or manipulated maliciously by input distributional deviations, must be monitored and assessed. Specifically, we study the case of monitoring the healthy operation of a deep neural network (DNN) receiving a stream of data, with the aim of detecting input distributional deviations over which the quality of the network's predictions is potentially damaged. Using selective prediction principles, we propose a distribution deviation detection method for DNNs. The proposed method is derived from a tight coverage generalization bound computed over a sample of instances drawn from the true underlying distribution. Based on this bound, our detector continuously monitors the operation of the network over a test window and fires off an alarm whenever a deviation is detected. This novel detection method consistently and significantly outperforms the state of the art with respect to the CIFAR-10 and ImageNet datasets, thus establishing a new performance bar for this task , while being substantially more efficient in time and space complexities.

1. INTRODUCTION

A wide range of artificial intelligence applications and services rely on deep neural models because of their remarkable accuracy. When a trained model is deployed in production, its operation should be monitored for abnormal behavior, and a flag should be raised if such is detected. Corrective measures can be taken if the underlying cause of the abnormal behavior is identified. For example, simple distributional changes may only require retraining with fresh data, while more severe cases may require redesigning the model (e.g., when new classes emerge). In this paper we focus on distribution shift detection in the context of deep neural models and consider the following setting. Pretrained model f is given, and we presume it was trained with data sampled from some distribution P . In addition to the dataset used in training f , we are also given an additional sample of data from P , which is used to train a detector D (we refer to this as the detection-training dataset). While f is used in production to process a stream of emerging input data, we continually feed D with the most recent window W k of k input elements. The detector also has access to the final layers of the model f and should be able to determine whether the data contained in W k came from a distribution different from P . Detection algorithms based on a window, such as we consider here, have rarely been considered in the context of deep neural networks. To the best of our knowledge window-based deep detection has only been considered by (Rabanser et al., 2019) . We emphasize that in this paper we are not considering the problem of identifying single-instance out-of-distribution or outlier instances (Liang et al., 2018; Hendrycks & Gimpel, 2017; Hendrycks et al., 2019; Golan & El-Yaniv, 2018; Ren et al., 2019; Nalisnick et al., 2019; Nado et al., 2021; Fort et al., 2021) , but rather the information residing in a population of k instances. Single-instance methods are trivially applicable to a window. However, these methods are not designed to detect population-based changes (see discussion in Section 2). We also note that this paper does not address the issue of characterizing the type of distribution shift, nor correcting it (by "redesigning" the model to make accurate predictions on the shifted distribution). The detection of distribution shifts is a fundamental topic in machine learning and statistics, and the standard method for tackling it is by performing a dimensionality reduction over both the detectiontraining (source) and test (target) samples, and then applying a two-sample statistical test over these reduced representations to detect a deviation. This is further discussed in Section 2. Distribution shift detection has been scarcely considered in the context of deep neural networks (DNNs). Deep models can benefit from the semantic representation created by the model itself, which provides meaningful dimensionality reduction that is readily available at the last layers of the model. Using the embedding layer (or softmax) along with statistical two-sample tests was recently proposed by (Lipton et al., 2018) and (Rabanser et al., 2019) who termed solutions of this structure black-box shift detection (BBSD). Using both the univariate Kolmogorov-Smirnov (KS) test and the maximum mean discrepancy (MMD) method, see details below, (Rabanser et al., 2019) achieve impressive detection results when using MNIST and CIFAR-10 as proxies for the distribution P . As we demonstrate here, the KS-BBSD method is also very effective over ImageNet when a stronger model is used (EfficientNet vs ResNet-18). BBSD methods have the disadvantage of being computationally intensive due to the use of two-sample tests between the detection-training set (which can, and are preferred to be the largest possible) and the window W (a complexity analysis is provided in table 1 ). We propose a different approach based on selective prediction (El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017) , where a model quantifies its prediction uncertainty and abstains from predicting uncertain instances. First, we develop a method for selective prediction with guaranteed coverage. This method identifies the best abstaining threshold and coverage bound for a given pretrained classifier f , such that the resulting empirical coverage will not violate the bound with a high probability (when abstention is determined using the threshold). The guaranteed coverage method is of independent interest, and it is analogous to selective prediction with guaranteed risk (Geifman & El-Yaniv, 2017). Because the empirical coverage of such a classifier is highly unlikely to violate the bound if the underlying distribution remains the same, a systematic violation indicates a shift in distribution. To be more specific, given a detection-training sample S m , our coverage-based detection algorithm computes log 2 m tight generalization coverage bounds, which are then used to detect a distribution shift in a window W of test data. Due to its aggressive reduction of S m to O(log m) numbers, the proposed detection algorithm is extremely efficient in its computation requirements, unlike the baseline algorithms mentioned above, which follow the framework depicted in Figure 3 in Appendix 7.1. For example, consider the JFT-3B dataset (Zhai et al., 2021) . Previous methods that require the processing of this set for each incoming window are infeasible, while our method allows one to summarize it with only 32 scalars. In a comprehensive empirical study, we compared our coverage-based detection algorithm with the best-performing BBSD baselines, including the KS approach of (Rabanser et al., 2019) . All methods used the same underlying models (ResNet-18, ResNet-50 and EfficientNet) for a fair comparison. We simulated source distributions using both the CIFAR-10 and ImageNet databases. Distribution shifts were produced using various methods, beginning with simple noise and ending with adversarial examples. Based on these experiments, we can claim that our coverage-based detection method is significantly more powerful than the baselines across a wide range of test window sizes. To summarize, the contributions of this paper are: (1) A theoretically justified algorithm (Algorithm 1), that produces a coverage bound, which is of independent interest, and allows for the creation of selective classifiers with guaranteed coverage. (2) A theoretically motivated "windowed" detection algorithm (Algorithm 2), which detects a distribution shift over a window. (3) A comprehensive empirical study demonstrating significant improvements relative to existing baselines over a variety of datasets and architectures.

2. RELATED WORK

Distribution shift detection methods often comprise the following two steps: dimensionality reduction, and a two-sample test between the detection-training sample and test samples. In most cases, these methods are "lazy" in the sense that for each test sample, they make a detection decision based on a computation over the entire detection-training sample. Their performance will be sub-optimal if only a subset of the train sample is used. Figure 3 in Appendix 7.1 illustrates this general framework. The use of dimensionality reduction is optional. It can often improve performance by focusing on a less noisy representation of the data. Dimensionality reduction techniques include no reduction, principal components analysis (Wold et al., 1987) , sparse random projection (Bingham & Mannila, 2001 ), autoencoders (Rumelhart et al., 1985; Pu et al., 2016 ), domain classifiers, (Rabanser et al., 2019) and more. In this work we focus on black box shift detection (BBSD) methods (Lipton et al., 2018) , that rely on deep neural representations of the data generated by a pretrained model. The representation we extract from the model will typically utilize either the softmax outputs

