A LEARNING BASED HYPOTHESIS TEST FOR HARMFUL COVARIATE SHIFT

Abstract

The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small 1 .

1. INTRODUCTION

Machine learning models operate on the assumption, albeit incorrectly that they will be deployed on data distributed identically to what they were trained on. The violation of this assumption is known as distribution shift and can often result in significant degradation of performance [Bickel et al., 2009; Rabanser et al., 2019; Otles et al., 2021; Ovadia et al., 2019] . There are several cases where a mismatch between training and deployment data results in very real consequences on human beings. In healthcare, machine learning models have been deployed for predicting the likelihood of sepsis. Yet, as [Habib et al., 2021] show, such models can be miscalibrated for large groups of individuals, directly affecting the quality of care they experience. The deployment of classifiers in the criminal justice system [Hao, 2019] , hiring and recruitment pipelines [Dastin, 2018] and self-driving cars [Smiley, 2022] have all seen humans affected by the failures of learning models. The need for methods that quickly detect, characterize and respond to distribution shift is, therefore, a fundamental problem in trustworthy machine learning. We study a special case of distribution shift, commonly known as covariate shift, which considers shifts only in the distribution of input data P(X) while the relation between the inputs and outputs P(Y |X) remains fixed. In a standard deployment setting where ground truth labels are not available, covariate shift is the only type of distribution shift that can be identified. For practitioners, regulatory agencies and individuals to have faith in deployed predictive models without the need for laborious manual audits, we need methods for the identification of covariate shift that are sample-efficient (identifying shifts from a small number of samples), informed (identifying shifts relevant to the domain and learning algorithm), model-agnostic (identifying shifts regardless of the functional class of the predictive model) and statistically sound (identifying true shifts while avoiding false positives with high-confidence). We build off recent progress in understanding model performance under covariate shift using the PQ-learning framework [Goldwasser et al., 2020] , a framework for selective classifiers that may either predict on or reject a given sample, that provides strong performance guarantees on arbitrary test distributions. Our work uses and extends PQ-learning to develop a practical, model-based hypothesis test, named the Detectron, to identify potentially harmful covariate shifts given any existing classification model already in deployment. Our work makes the following key contributions: • We show how to construct an ensemble of classifiers that maximize out-of-domain disagreement while behaving consistently in the training domain. We propose the disagreement cross entropy for models learned via continuous gradient-based methods (e.g., neural networks), as well as a generalization for those learned via discrete optimization (e.g., random forest). • We show that the rejection rate and the entropy of the learning ensemble can be used to define a model-aware hypothesis test for covariate shift, the Detectron, that in idealized settings can provably detect covariate shift. • On high-dimensional image and tabular data, using both neural networks and gradient boosted decision trees, our method outperforms state-of-the-art techniques for detecting covariate shift, particularly when given access to as few as ten test examples. x F H F V y C K 3 A D b P A A W q A N O q A H M P g C v 1 p J K 2 s / O t A r e m 0 1 q m u F 5 w L 8 K 7 3 + B 4 3 J t Q Q = < / l a t e x i t > H0 : P = Q Ha : P < Q < l a t e x i t s h a 1 _ b a s e 6 4 = " g O D Z k 7 v f 3 h c T E Z l 9 / w y Z + 0 x Y V 5 o = " > A A A B 8 n i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 r L o x m U F + 4 A 2 l M l 0 0 g 6 d T M L M j V B C P 8 O N C 0 X c + j X u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T u 9 z v P H F t R K w e c Z p w P 6 I j J U L B K F q p 1 4 8 o j h m V W X M 2 q N b c u j s H W S V e Q W p Q o D m o f v W H M U s j r p B J a k z P c x P 0 M 6 p R M M l n l X 5 q e E L Z h I 5 4 z 1 J F I 2 7 8 b B 5 5 R s 6 s M i R h r O 1 T S O b q 7 4 2 M R s Z M o 8 B O 5 h H N s p e L / 3 m 9 F M M b P x M q S Z E r t v g o T C X B m O T 3 k 6 H Q n K G c W k K Z F j Y r Y W O q K U P b U s W W 4 C 2 f v E r a F 3 X v q n 7 5 c F l r 3 B Z 1 l O E E T u E c P L i G B t x D E 1 r A I I Z n e I U 3 B 5 0 X 5 9 3 5 W I y W n G L n G P 7 A + f w B i g e R b w = = < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = " t 2 n X H f Z o P 3 H L B L U e k R g 0 o D H O f U Q = " > A A A B 8 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I U Z d F N y 5 b s A + Y D i W T Z t r Q T D I k G a E M / Q w 3 L h R x 6 9 e 4 8 2 / M t L P Q 1 g O B w z n 3 k n N P m H C m j e t + O 6 W N z a 3 t n f J u Z W / / 4 P C o e n z S 1 T J V h H a I 5 F L 1 Q 6 w p Z 4 J 2 D D O c 9 h N F c R x y 2 g u n 9 7 n f e 6 J K M y k e z S y h Q Y z H g k W M Y G M l f x B j M y G Y Z + 3 5 s F p z 6 + 4 C a J 1 4 B a l B g d a w + j U Y S Z L G V B j C s d a + 5 y Y m y L A y j H A 6 r w x S T R N M p n h M f U s F j q k O s k X k O b q w y g h F U t k n D F q o v z c y H G s 9 i 0 M 7 m U f U q 1 4 u / u f 5 q Y l u g 4 y J J D V U k O V H U c q R k S i / H 4 2 Y o s T w m S W Y K G a z I j L B C h N j W 6 r Y E r z V k 9 d J 9 6 r u X d c b 7 U a t e V f U U Y Y z O I d L 8 O A G m v A A L e g A A Q n P 8 A p v j n F e n H f n Y z l a c o q d U / g D 5 / M H i 4 y R c A = = < / l a t e x i t > Q f (x) Constrain g (•) to agree with f on the training set  g Q (x) g P (x)

2. BACKGROUND AND RELATED WORK

Covariate Shift Detection. Covariate shift is the tendency for a distribution at test time p test (x) to differ from that seen during training p train (x) while the underlying prediction concept y remains fixed e.g. p train (y|x) = p test (y|x). Many methods for detecting shift apply dimensionality reduction followed by statistical hypothesis tests for distributional differences in the outputs (from a reference and target) [Rabanser et al., 2019] . Rabanser et al. show that using the softmax outputs of a pretrained classifier as low dimensional representations for performing univariate KS-tests, a method known as black box shift detection (BBSD) [Lipton et al., 2018] , is effective at confidently identifying several synthetic covariate shifts in imaging data (e.g. crops, rotations) given approximately 200 i.i.d samples. However, applying statistical tests to non-invertible representations of data can never guarantee to capture arbitrary covariate shifts, as there may always exist multiple distributions that collapse to the same test statistic [Zhang et al., 2021] . Kifer et al. [2004] ; Ben-David et al. [2006] introduce some of the earliest learning theoretic approaches for identifying and correcting for covariate shift based on discriminative learning with finite samples. More recent approaches for covariate shift detection including classifier two sample tests [Lopez-Paz and Oquab, 2017] , deep kernel MMD [Liu et al., 2020] and H-Divergence [Zhao et al., 2022] rely on analyzing the outputs of unsupervised learning models (see Appendix subsection E.3 for more details). In our work we take a transductive learning approach and construct a method to directly use the structure of a supervised classification problem to improve the statistical power for detecting shifts. Out of Distribution Detection. Out of distribution (OOD) detection focuses on identifying when a specific data point x ′ admits low likelihood under the original training distributions (p train (x ′ ) ≈ 0)a useful tool to have at inference time. Ren et al. [2019] ; Morningstar et al. [2021] represent a broad class of work that uses density estimation to pose the identification of covariate shift as anomaly detection. Others, including ODIN [Liang et al., 2018] , Deep Mahalanobis Detectors [Lee et al., 2018] and, Gram Matrices [Sastry and Oore, 2020] directly use the predictive model (e.g. information from the intermediate representations of neural networks). The majority of modern methods in this space have been designed exclusively for deep neural networks, an uncommon modelling choice particularly for tabular data [Borisov et al., 2021] . Related to OOD, is the task of estimating sources of uncertainty in model predictions [Lakshminarayanan et al., 2017; Ovadia et al., 2019] . Naturally, uncertainty should be large when samples are OOD, however [Ovadia et al., 2019] perform a large-scale empirical comparison of uncertainty estimation methods and find that while deep ensembles generally provide the best results, the quality of uncertainty estimations, regardless of method, consistently degrades with increasing covariate shift. Selective Classification and PQ Learning. Selective classification concerns building classifiers that may either predict on or reject on test samples [Geifman and El-Yaniv, 2019] . Recent work by [Goldwasser et al., 2020] develops a formal framework known as PQ learning which extends probably approximately correct (PAC) learning [Haussler, 1990] to arbitrary test distributions by allowing for selective classification. While PAC learning concerns the development of a classifier with a bounded finite-sample error rate on its training distribution, PQ learning seeks a selective classifier with jointly bounded finite-sample error and rejection rates on arbitrary test distributions. The Rejectron algorithms proposed therein builds an ensemble of models that produce different outputs relative to a perfect baseline on a set of unlabeled test samples. We provide a summary of the original Rejectron algorithm in the supplementary material (see Appendix A). PQ-learning represents a major theoretical leap for learning guarantees under covariate shift; however, the majority of the underlying ideas have not been implemented/tested experimentally using real-world data. We show how to build a PQ learner by generalizing the Rejectron algorithm, overcoming several limitations and assumptions made by the original work including extending beyond simple binary classification to general multiclass and multilabel tasks and reducing the number of samples required for learning at each iteration.

3. DETECTRON

Problem Setup. Let f : X → Y be classification model from a function class F that maps from space of covariates X to a discrete set of classes Y = {1, . . . , N }. We assume f was trained on a dataset of labeled samples P = {(x i , y i )} n i=1 where each x i is drawn identically from a distribution P over X. In deployment, f is then made to predict on new unlabeled samples Q = {x i } m i=1 from a distribution Q over X. Our goal is to determine whether f may be trusted to do so accurately. The problem we address is how to automatically detect, from only a set of finite samples Q, if the new covariate distribution Q has shifted from P in such a way that f can no longer be assumed to generalize -we refer to this type of dataset shift as harmful covariate shift. Harmful Covariate Shift. A shift in the data distribution is not always harmful. In many practical problems, a practitioner may use domain knowledge to embed invariances with the explicit goal of ensuring the predictive performance of a classifier does not, by construction, change under certain shifts. This may be done directly via translation invariance in convolutional neural networks, permutation invariance in DeepSets [Zaheer et al., 2017] or indirectly via data augmentation or domain adaptation. Such practical heuristics can lead to models generalizing to a more broad range of distributions than can be characterized by just the training set. We refer to such an induced generalization set as R. Although R is difficult to characterize and will in general depend on the model architecture, learning algorithm and training dataset, we seek a practical method for detecting shift that is explicitly tied to Rfoot_1 . We present a more formal definition of harmful covariate shift as well as a connection to domain identification and A distance [Kifer et al., 2004] in Appendix H. Our approach is based both on PQ learning and intuition from learning theory. If there exists a set of classifiers with the same generalization set R but behave inconsistently on samples from a distribution Q, then Q must not be a member R. Our strategy will be to create an ensemble of constrained disagreement classifiers, classifiers constrained to predict consistently (i.e., predict the same as f ) on R but as differently as possible on Q. If Q is within R then such an ensemble will fail to predict differently. When we can find an ensemble that exhibits inconsistent behaviour on Q, there must be covariate shift that explicitly lies outside R. To make the idea of constrained disagreement classifiers tangible we propose a simple definition which we will translate into a learning algorithm in the following sections. Constrained Disagreement Classifier (CDC). A constrained disagreement classifier g (f,P,Q) , or simply g if f , P and Q are clear with context, is a classifier with the following properties: 1. g belongs to the same model class as f ∈ F and is trained with the same algorithm on P 2. g achieves similar performance on unseen samples drawn i.i.d to P 3. g disagrees maximally with f on elements of dataset Q while not violating 1 and 2 Our definition of a CDC aims to explicitly capture the concept of a classifier that learns the same generalization region as f while behaving as inconsistently as possible on Qfoot_2 . Limitations of PQ Learning. As our work builds on PQ learning, we provide a summary of the original framework and clearly state the distinctions in our methodology. In PQ learning we seek a selective classifier h that achieves a bounded tradeoff between its in distribution rejection rate rej h (x) and its out of distribution error err h (x) with respect to a ground truth decision function d. Formally, this tradeoff is defined using the following learning theoretic bound (an extended description can be found in Appendix A.1). PQ learning [Goldwasser et al., 2020] Learner L (ϵ, δ, n)-PQ-learns F if for any distributions P, Q over X and any ground truth function d ∈ F , its output h := L(P, d(P), Q) satisfies Learning to Disagree with the Disagreement Cross Entropy. To train a classifier to disagree in the binary setting, it suffices to flip the labels. However, in the multi-class classification, it is unclear what a good objective function is. We formulate an explicit loss function that can be minimized via gradient descent to learn a CDC. For classification problems, letting ŷ := g(x i ) be the predictive distribution over N classes with the i th denoted by ŷi , f (x i ) ∈ {1, . . . N } the label predicted by f and 1 (•) a binary indicator, we define the disagreement-cross-entropy (DCE) l as: P x∼P n , x∼Q n [rej h (x) + err h (x) ≤ ϵ] ≥ 1 -δ (1) l(ŷ, f (x i )) = 1 1 -N N c=1 1 f (xi)̸ =c log(ŷ c ) (2) l corresponds to taking the cross entropy of ŷ with the uniform distribution over all classes except f (x i ). Since the primary criteria is that g(x i ) disagrees with f (x i ), l is designed to minimize the probability of g's prediction for the output of f 's while maximizing its overall entropy. Our definition of l is stable to optimize, has a bounded global minimum, and can be shown to have desirable properties for disagreement learning, see Appendix D for more details. Our goal is to agree on P and disagree on Q. Consequently, we learn with the loss in Equation 3. ℓ denotes the standard cross entropy loss and l is the disagreement cross entropy. λ is a scalar parameter that controls the trade off between agreement and disagreement. L CDC (P, Q) = 1 |P ∪ Q|   (xi,yi)∈P ℓ(g(x i ), y i ) + λ xi∈Q l(g(x i ), f (x i ))   When learning CDCs in practice, L CDC should be combined with any additional regularization and data augmentation used in the original training process of f to ensure that we retain the true generalization region of f . Furthermore, training and validation metrics must be closely monitored on unseen samples from P to ensure that g achieves similar generalization performance on P. Choosing λ. In the original formulation of Rejectron, selective classifiers are trained on a dataset consisting of P replicated |P| times and Q. Calling an ERM oracle on this data ensures that a misclassification on P is significantly more costly than one on Q but requires Ω(|P| 2 ) samples, an impractical number for large datasets. We show that instead we can choose the scalar parameter λ in Equation 3 to set learning P as the primary learning objective and only when it cannot be improved, we allow g to learn how to disagree on Q. The reasoning is a simple counting argument. Suppose agreeing on each sample in P incurs a reward of 1 and disagreeing with each sample in Q a reward of λ. To encourage agreement on P as the primary objective, we set λ such that the extra reward obtained by going from zero to all disagreements on Q is less than that achieved with only one extra agreement on P, this gives λ|Q| < 1. Practically, we chose λ = 1/(|Q| + 1) and find that no tuning is required. Some predictive models require learning with non-differentiable loss functions and/or discrete optimization (e.g., random forest). In such cases we fall back and formulate a general disagreement loss by duplicating each sample in Q (N -1) times, giving each a unique label that is not the target and weight of 1/(N -1). In the continuous case, this corresponds exactly to Equation 2. To learn richer disagreement rules, we create an ensemble of CDCs where the k th model is trained only to disagree on the subset of Q that has yet to be disagreed on by models 1 through k -1. The final disagreement rate ϕ Q is the fraction of unlabelled samples where any CDC provides an alternate decision from f . In what follows we use this rate to characterize shift. From Constrained Disagreement to Detecting Shift with Hypothesis Tests. A natural way to apply the concept of constrained disagreement to the identification of covariate shift is to partition Q into two sets, using the first to train a CDC ensemble and the second to compute an unbiased estimate of its held out disagreement rate ϕ Q . We would statistically compare this disagreement rate using a 2 × 2 exact hypothesis test against a baseline estimate for the disagreement rate on P. The following shows that this results in a provably correct method to detect shift. Theorem 1 (Disagreement implies covariate shift). Let f be a classifier trained on dataset P consisting of samples drawn identically from P and their corresponding labels. Let g be a classifier that is observed to agree (classify identically) with f on P and disagree on a dataset Q drawn from Q. If the rate which g disagrees with f on n unseen samples from Q is greater than that from n unseen samples from P w.p. greater than p ⋆ := 1 2 1 -4 -n 2n n there must be covariate shift. Sketch of Proof. We show that under the null hypothesis where P = Q the tightest upper bound on the probability that g is more likely to disagree on Q compared to P is p ⋆ . The contrapositive argument then states if we deem the probability to be greater than p ⋆ there must be a covariate shift. This result motivates a hypothesis testing approach to determine how probable it is that g is truly more likely to disagree on Q given only a set of finite observations. The full proof can be found in Appendix C. Our theory, while simple, has a limitation that prevents its direct application. Any approach that requires unseen samples from Q is ill-suited for the low data regime, as it requires splitting Q leaving an even smaller set for computing the disagreement rate. Estimators from small samples result in high variance and ultimately low statistical power. Since our objective is to detect covariate shift from as few test samples as possible, splitting Q is not a good option. To tackle this issue practically, we take a transductive approach based on intuition from learning theory: creating learning models to disagree on samples from Q while generalizing to R is a far easier task when Q is not in R. We can therefore use the relative increase in disagreement between CDCs on Q and P to capture a quantity that is nearly as informative as the unbiased statistic without reducing samples from Q that we can use. The Detectron Test. Our proposed method is to train two CDC ensembles g Q and g P . g Q is trained to disagree on all of Q and a g P is a baseline trained to disagree an unseen set P ⋆ from P. Once trained, we compute the ratio of samples ϕ Q and ϕ P that g Q and g P learn to disagree on their sets respectively. Under the null hypothesis where Q belongs to the generalization region of f and P, E[ϕ Q ] ≤ E[ϕ P ], while harmful shift is expressed as a one sided alternative H a : E[ϕ Q ] > E[ϕ P ] (i.e. , it is easier to learn how to reject on Q compared to P). We refer to this test as Detectron (Disagreement). To compute the test result at a significance level α we first estimate the null distribution of ϕ P for a fixed sample size n by training the Detectron for K calibration rounds with different random seeds and sets P ⋆ . The test result is significant if the observed disagreement rate ϕ Q is greater than the (1α) quantile of the null distribution. For more information on the testing procedure see algorithm 1 below and a detailed description in Appendix B. We consider an additional variant, Detectron (Entropy), which computes the prediction entropy of each sample under the CDC ensemble instead of relying solely on disagreement rates. The CDC entropy is computed from the mean probabilities over each N classes of the base classifier f and set of k CDCs g 1 , . . . , g k . CDC entropy (x) = - N c=1 pc log(p c ) where p := 1 k + 1 f (x) + k i=1 g i (x) We use a KS test to compute a p-value for covariate shift directly on the entropy distributions computed for Q and P ⋆ and guarantee significance using the same strategy as above. The intuition for Detectron (Entropy) draws from the fact that when CDCs satisfy their objective (i.e., in the case of harmful shift) they learn to predict with high entropy on Q and low entropy on P ⋆ , resulting in a natural way to distinguish between distributions.  n > 0 and iterations ≤ ℵ do g ← ConstrainedDisagreement(L, P train , P val , Q, f ) Q ← {x | x ∈ Q and f (x) = g(x)} ϕ Q ← 1 -|Q|/N end return ϕ Q > [(1 -α) quantile of Φ P ]

4. EMPIRICAL EVALUATION

Our experiments are carried out on natural distribution shifts across multiple domains, modalities, and model types. We use the CIFAR-10.1 dataset [Recht et al., 2019] where shift comes from subtle changes in the dataset creation processes, the Camelyon17 dataset [Veeling et al., 2018] for metastases detection in histopathological slides from multiple hospitals, as well as the UCI heart disease dataset [Janosi et al., 1988] which contains tabular features collected across international health systems and indicators of heart disease. We present unseen source an target domain performance of base models trained on source data in Appendix Table 4 which shows significant performance drops as an indicator for the hamrfulness of these shifts. See Appendix F for more details on datasets. Learning Constrained Disagreement. We begin by training ensembles of 10 CDCs using the disagreement cross entropy (DCE) with CIFAR-10 as P and CIFAR-10.1 as Q for 100 random runs at a sample size of 50 (see Appendix E for training details). The results in Figure 2 empirically validates minimizing the DCE as a CDC learning objective. The first observation is that when an unseen set is drawn from a shifted distribution Q, the empirical disagreement rate ϕ Q grows significantly larger than the baseline disagreement rate ϕ P . Next, we see that CDCs preserve accuracy on data from the training distribution. Finally, as the ensemble size increases (and disagreed upon points are removed) we see that the accuracy of the classifier increases. This indicates that the points that are disagreed upon early on in the algorithm are those that would have been misclassified. We see that for all ensemble sizes, there is lower disagreement on unshifted data (CIFAR-10) compared to disagreement on shifted data (CIFAR-10.1). (Center) Constrained disagreement does not compromise indistribution performance. (Right) As the ensemble grows the selective classification accuracy computed on the set of test examples that all models agree on increases both on in-distribution and out-of-distribution data. Confidence intervals are computed as ± one standard deviation across experiments. Shift Detection Setup. We evaluate the Detectron in a standard two-sample testing scenario similar to prior work [Zhao et al., 2022] . Given two datasets P = {(x i , y i )} n i=1 (x i drawn from P) and Q = {x i } m i=1 (x i drawn from Q) and classifier f , we seek to rule out the null hypothesis (P = Q) at the 5% significance level. To guarantee fixed significance we employ a permutation test by first sampling from the distribution of test statistics derived by the Detectron where the null hypothesis P = Q holds (i.e., Q is drawn P). We then compute a threshold over the empirical test statistic distribution that sets the false positive rate to 5% (see Appendix B Figure 5 ). This step can be performed in advance of deployment as it only requires access to P. To mimic deployment settings where we wish to identify covariate shift quickly, we assume access to far more samples from P compared to Q. For each dataset, we begin by training a base classifier on the unshifted dataset. We evaluate the detection of covariate shift on 100 randomly selected test sets of n = 10, 20 and 50 samples from Q. In all cases we train a maximum ensemble size of 5 (parameter ℵ in algorithm 1). To prevent CDCs from overfitting in the case of small test set sizes, we perform early stopping if in-distribution validation performance drops by over 5% from the measured performance of the base classifier. Hyperparameters and training details for all models can be found in Appendix E. Evaluation. We report the True Positive Rate at 5% Significance Level (TPR@5) aggregated over 100 randomly selected sets Q. This signifies how often our method correctly identifies covariate shift (P ̸ = Q) while only incorrectly identifying shift 5% of the time. This is also referred to as the statistical power of a test at a significance level (α) of 5%. Baselines. We compare the Detectron against several methods for OOD detection, uncertainty estimation and covariate shift detection. Deep Ensembles Ovadia et al. [2019] using both (1) disagreement and (2) entropy scoring methods as a direct ablation to the CDC approach (3) Black Box Shift Detection (BBSD) [Lipton et al., 2018] . (4) Relative Mahalanobis Distance (RMD) [Ren et al., 2021] . ( 5 7) H Divergence (H-Div) [Zhao et al., 2022] . For more information on baselines see Appendix subsection E.3. Shift Detection Experiments. We begin with an analysis of the performance of the Detectron on the UCI Heart Disease dataset. Using a sample size ranging from 10 to 100 we compute the TPR@5 (based on 100 random seeds) and plot the results in Figure 3 . We use gradient boosted trees (XGBoost [Chen and Guestrin, 2016] ) for Detectron and CTST methods while the remaining baselines use a 2 layer MLP that achieves similar test performance. We report the mean and standard error of TPR@5 for sample sizes of 10, 20 and 50 on all datasets in Table 1 .

4.1. DISCUSSION

Overall Performance. We observe in the bottom rows of Table 1 that Detectron methods outperform all baselines across all three tasks. This confirms our intuition that designing distribution tests 1 . based specifically on available data and the outputs of learning algorithms is a promising avenue for improving existing methods in the high dimensional/low data regime. Sample Efficiency. For more significant shifts (Camelyon and UCI), we see in Table 1 the most significant improvements over baselines in the lowest sample regime (10 data points). The finegrained result in Figure 3 shows that CTST catches up to Detectron at 40 samples while deep ensemble, BBSD, and Mahalanobis catch up at 100.

Disagreement vs Entropy.

For the experiments on imaging datasets with deep neural networks Detectron (Disagreement) often performs nearly as well as Detectron (Entropy), while Detectron (Entropy) is strictly superior for the UCI dataset. While we recommend entropy as the method to maximize test power, disagreement is a more interpretable statistic as it is correlates well with the portion of misclassified samples (see (right) Figure 2 ). Comparison to baselines. Amongst the baselines, there is no clear best method. On average, ensemble entropy is superior on CIFAR, MMD-D on Camelyon, and CTST on UCI. Our method may be thought of as a combination of ensembles, CTST, and H-Divergence. As ensembles, we leverage the variation in outputs between a set of classifiers; as CTST, we learn in a domain adversarial setting; and as H-Divergence, we compute a test statistic based on data that a model was trained on. Lastly, while MMD-D and H-Divergence were shown to be the previous state-of-the-art, their performance was validated only on larger sample sizes (≥ 200). On Tabular Data. The Detectron shows promise for deployment on tabular datasets (bottom right of Table 1 and Figure 3 ), where (1) the computational cost of training models is low, (2) the model agnostic nature of the Detectron is beneficial as random forests often outperform neural networks in tabular data [Borisov et al., 2021] , and (3) based on our discussions with medical professionals, the ability to detect covariate shift from small test sizes is of particular interest in the healthcare domain where population shift is a constant problem burden for maintaining the reliability of deployed models. On computational cost: Our method is more computationally expensive than some existing methods for detecting shifts such as BBSD and Mahalanobis Scores, but is similar complexity to other approaches such as Ensembles, MMD-D and H-Divergence which may require training multiple deep models. However, as the Detectron leverages a pretrained model already in deployment, we find in practice that only a small number of training rounds are required to create each CDC. Furthermore, looking at the runtime behavior in Figure 4 we see that while allowing for more computation time increases the fidelity of the Detectron, only a small number of training batches may be required to achieve a desirable level of statistical significance. In scenarios where the deployed classifier is deemed high-risk (e.g. healthcare, justice system, education) we believe the additional computational expense is justified for an accurate assessment of whether the classifier needs updating. Having established the utility, accelerating the Detectron as well as building a deeper understanding of the runtime performance is fertile ground for future work. 

5. CONCLUSION, LIMITATIONS AND FUTURE WORK

Our work presents a practical method capable of detecting covariate shifts given a pre-existing classifier. On both neural networks and random forests, we showcase the efficacy of our method to detect covariate shift using a small number of unlabelled examples across several real-world datasets. We remark on several characteristics that represent potential directions for future work: Beyond Classification: Our work here focuses on the case of classification, however, we believe there is a viable extension of our work to regression models where constrained predictors are explicitly learned to maximize test error according to the existing metric, such as mean squared error. We leave this exploration for future work. On Finally, we wish to highlight that while auditing systems such as the Detectron show promise to ease concerns when using learning systems in high-risk domains, practitioners interfacing with these systems should not place blind trust in their outputs.

ETHICS STATEMENT

The speed of the adoption of ML into risk scenarios raises a critical need for methods that ensure trustworthiness and reliability. However, over-reliance on such methods brings about critical ethical considerations. As we have seen, Detectron is highly sensitive to picking out discriminative features of data distributions, and, as such, its usage may prevent practitioners from deploying models in new environments. As a result, the individuals in those environments may become subject to unfair treatment. For instance, if Detectron determines a model trained on hospital A safe to deploy in hospitals B and C, but not in D, the individuals in population D may experience a lower level of care. In a real example, Detectron, when tested on a model trained on a subset of light-skinned celebrities (CelebA dataset- Liu et al. [2015] ), quickly raises the alarm when given images of those that are not light-skinned. While Detectron can help mitigate potential disasters encountered by deploying models in hazardous domains, it should not be an excuse for practitioners to avoid collecting richer and more diverse datasets as a primary strategy to ensure model reliability.

A LEARNING ALGORITHMS

This section presents further details on the primary learning algorithms used and referred to in our work.

A.1 REJECTRON

We provide a summary of the original Rejectron algorithm for PQ learning [Goldwasser et al., 2020] as it is the primary motivation for our work. Rejectron (algorithm 2) takes as input a labeled training set of n samples x (iid over P), an unlabeled test set of n samples x (iid over Q), an error ϵ and a weight Λ. The output is a selective classifier, that predicts according to a base classifier h if the input x is inside some set S and otherwise rejects (abstains from predicting). h| S (x) := h(x) x ∈ S reject x ̸ ∈ S The error and rejection rate of a selective classifier are defined for a selective classifier h| S with respect to a distribution P as: rej h (S) := P x∼P [x ̸ ∈ S] err h (S) := P x∼P [h(x) ̸ = y ∧ x ∈ S] The empirical rejection and error rates for a set of samples x = {x i } n i=1 and corresponding labels y = {y i } n i=1 are similarly defined as: rej h (x) := 1 n n i=1 1 xi̸ ∈S err h (x) := 1 n i=1 x i ∈ S n i=1 (1 xi̸ =yi × 1 xi∈S ) Under the selective classification framework Goldwasser et al. extend the conventional concept of PAC learning [Haussler, 1990] to test samples drawn from an unknown distribution Q. PQ learning [Goldwasser et al., 2020] Learner L (ϵ, δ, n)-PQ-learns F for 0 ≤ ϵ ≤ 1, 0 ≤ δ ≤ 1 and n ∈ Z + if for any distributions P, Q over X and any ground truth function d ∈ F , its output h := L(P, d(P), Q) satisfies P x∼P n , x∼Q n [rej h (x) + err h (x) ≤ ϵ] ≥ 1 -δ Under several assumptions and a special value ϵ ⋆ , this selective classifier is guaranteed with high probability to have error less then 2ϵ ⋆ on x and a rejection rate below ϵ ⋆ on x (see Theorem 5.7 in Goldwasser et al.) . Algorithm 2: Rejectron [Goldwasser et al., 2020] Input: train x ∈ X n , labels y ∈ Y n , test x ∈ X n , error ϵ ∈ [0, 1], weight Λ = n + 1 Output: selective classifier h| S h ← ERM(x, y) for t = 1, 2, 3, . . . do 1. S t := {x ∈ X : h(x) = c 1 (x) = . . . = c t-1 (x)} 2. Choose c t ∈ C to maximize s t (c) := err x h| St , c -Λ • err x (h, c) over c ∈ C 3. If s t (c t ) ≤ ϵ,

then stop and return h| St

Rejectron starts by querying an empirical risk minimization (ERM) oracle that uses a 0-1 risk score over a concept class C for a model h that perfectly learns the training dataset. A primary assumption for Rejectron to output a perfect model as well as for it to eventually find a selective classifier that meets the ϵ ⋆ bound is that the true decision function (i.e., the function that creates the training labels) is also a member of C. The authors refer to this setting as realizable. On the first iteration of the algorithm, Rejectron finds another model c 1 ∈ C that jointly maximizes the error with respect to h on x while minimizing it on x. The authors show that they can efficiently solve this optimization problem using a single ERM query on a dataset of n 2 + n samples (see Lemma 5.1 in Goldwasser et al.) . In every subsequent step t > 1 a set S t is created where all models h through to c t-1 agree. Another model c t ∈ C is found that maximizes the same objective as above but only on the intersection of x and S t . Upon termination a selective classifier h| St is output.

A.2 CONSTRAINED DISAGREEMENT CLASSIFIERS

We formally present the algorithm for creating a constrained disagreement classifier (section 3), a fundamental tool used to detect distribution shift in our work. The main inputs are a labeled training set, an unlabeled test set, and a classifier trained on P using a learning algorithm L. Three other hyperparameters include a metric to evaluate the performance of a classifier on a labeled dataset (e.g., accuracy), a tolerance ϵ which controls how much the metric may drop during disagreement steps and and a maximum number of epochs.  Q ← {(x, f (x)) | x ∈ Q} // infer pseudo labels on Q using f // create a batched dataloader using P and g ← Update(g, L, (x P , y P ), (x Q , y Q )) end end return g Q PQ ← Batched({(x, y) | (x, y) ∈ P ∧ (x, y) ∈ Q}) g ← f // Initialize g with f m 0 ← M(f, P val ) // Our algorithm is a practical generalization of the inner step in Rejectron: (1) We require tight performance monitoring on a validation set to prevent overfitting as we commonly observed that training models to disagree on small in-distribution test sets cause a drop in their in-distribution performance on unseen data after many training epochs. (2) We allow for arbitrarysized train and test sets. (3) We provide a methodology (section 3 Learning to Disagree with the Disagreement Cross Entropy) for updating arbitrary classification models to disagree on test data while agreeing on in-distribution data that leverages the generalization structure of their original learning algorithm while requiring quadractically fewer training samples then Rejectron. Further details on implementation details for step (3) can be found below in Appendix D.

B HYPOTHESIS TESTS B.1 TWO SAMPLE TESTING METHODOLOGY

Our method to detect covariate shift, like prior work, is to perform a statistical hypothesis test between the distributions of one or low dimensional quantities derived using each element of a possibly shifted target dataset Q and a known in-distribution source dataset P ⋆ that has not been observed in during model development. A significant motivation for our work was that the majority of statistical hypothesis tests used by prior work are formulated in a fashion that is independent of the particular target dataset being tested (e.g., BBSD [Lipton et al., 2018] which uses a pre-trained classifier as an ansatz for a dimensionality reduction). However, with the Detectron we follow a transductive approach by building a statistical test by training classifiers to meet a carefully crafted objective (i.e., constrained disagreement) on the target data. A drawback of this approach is that low dimensional representations are, in general, not be iid; hence to perform a fair statistical test, we must run the Detectron on a known in-distribution dataset under the same experimental conditions (e.g., sample size, learning rate, ensemble size). For other baseline methods that do not take the transductive approach (e.g., Mahalanobis, BBSD, Ensemble), we are not limited to choosing a source dataset P ⋆ of the same size as the samples can again be assumed to be iid. In practice, for iid methods, we fix the size of P ⋆ to 1000 for CIFAR and Camelyon, and in UCI Heart Disease, we use only 120 as the dataset is significantly smaller (920 samples).

B.2 STATISTICAL TESTS USED IN METHODS/BASELINES

We provide a summary in the context of our work on the three types of statistical tests used in our experiments. We also explain technical details on how we use each test in our experiments. Kolmogorov-Smirnov (KS) Test. The KS test is one of the most common non-parametric univariate statistical tests. It the two sample setting given datasets X = {x 1 , . . . , x n } and Y = {y 1 , . . . , y m } the test statistic is computed as the maximum difference between the empirical CDFs of X and Y . An asymptotically correct p-value can is computed using a closed-form expression of the test statistic and sample sizes n and m. An exact p-value can also be found by considering the fraction of every possible pair of empirical CDFs that lie within the region with a maximum bounded difference; more details can be found in Hodges [1958] . In practice we use the KS test implementation found in scipy.stats.ks_2samp which automatically computes exact p-values when max{n, m} ≤ 10, 000 and otherwise defaults to the asymptotic approximation. We use KS tests for any distributions derived from continuous scores within our methods and baselines. For the Relative Mahalanobis Score test [Ren et al., 2021] , we compute the p-value for shift via a KS test between the Mahalanobis score for the possibly shifted target data Q and a source dataset of unseen in distribution samples P ⋆ . In BBSD [Lipton et al., 2018] we compute a KS test on each dimension of the softmax output of a classifier between the source and target datasets, the final p-value is found via Bonferroni correction as is done by Rabanser et al. [2019] , which simply takes the minimum p-value and divides it by the number of tests (e.g the softmax dimension). Similarly, using Deep Ensemble (Entropy) and Detectron (Entropy), we perform a KS test directly on the distribution of entropy values computed from each sample in the source and target datasets, respectively. See Figure 6 for a full description of the Detectron entropy test. Binomial Test. The binomial test is simple to state and has an elegant closed-form solution. We consider a binomially distributed random variable with rate q X ∼ Binomial(n, q) for which we observe a single sample x. Since the binomial distribution is defined as a sum of iid Bernoulli random variables with the same rate, x may equivalently be interpreted as a set of n samples of which x are 1 and nx are 0. Given a a baseline rate p we wish to determine the probability of observing an event at least as rare as X = x under the null hypothesis that p = q, this quantity can be computed exactly using the symmetry of the binomial distribution. P X∼Bin(n,p) (X is rarer then x) = 2 × P X∼Bin(n,p) (X ≥ x) = 2 n k=x P[X = k] = 2 n k=x (1 -p) n-k p k n k = 2 B p (x, n -x + 1) B(x, n -x + 1) Where B z (α, β) is the incomplete Beta function and B(α, β) is the beta function. Binomial testing is used in the Deep Ensemble (Disagreement) baseline method where we estimate p as the disagreement rate of a deep ensemble on the set P ⋆ (i.e the number of samples in P ⋆ where the ensemble does not predict unanimously divided by the size of P ⋆ ) and test for distribution shift based on the result of a binomial test on the observed disagreement on Q. Binomial testing is also used for the classifier two sample test method (CTST) [Lopez-Paz and Oquab, 2017] . First a domain classifier is trained to separate source and target data then its performance is tested on a set of unseen data where the number of samples of a total of N it correctly assigns a domain label to is compared to the null distribution Bin(N, 0.5) (i.e. random guessing). For implementation purposes we use scipy.stats.binomtest. Permutation Test. Our ultimate goal is to detect covariate shift P ̸ = Q at a bounded significance level (i.e. bounded probability of outputting P ̸ = Q when in fact P = Q). To bound the significance level, we follow the simple and principled approach of the permutation test. Suppose we wish to run the Detectron to test for shift on a set Q from a baseline P ⋆ (each of N samples) while Detectron, or any other test, computes a p-value on some low dimensional samples derived from Q and P ⋆ , the significance threshold on that test will not in general correspond precisely to the significance for rejecting the original null hypothesis P = Q. The permutation test allows us to reclaim statistical guarantees by first performing several tests where the null hypothesis holds (e.g., we draw Q from P) and find a cutoff for the significance of a p-value that sets the false positive rate at exactly 5%. Our experiments run the Detectron 100 times for each sample size on random sets Q drawn from P. Based on these runs we compute 95 th percentile (τ ) on the final rejection rate. We then run the actual test using a set of samples Q drawn from Q which we deem significant at the 5% level if the number of rejected samples is greater than τ . A visual description of this method can be found in F igure 5. We perform 100 of these calibration runs using different random seeds and samples for P ⋆ to estimate a threshold τ such that 95% of the runs reject fewer then τ samples -thereby fixing the significance level of the test to 5%. To estimate the test power, we train CDCs using the exact same configuration as the calibration runs except we replace P ⋆ with a random set of N samples Q from Q (CIFAR 10.1). By averaging the number of runs the reject more than τ samples we can compute the power (or true positive rate) of the test for the configuration. Figure 6 : The Detectron entropy test: Following the same experimental setup as Figure 5 , we start (left) by computing a KS test between the continuous entropy values for each calibration run P ⋆ with the flattened set of entropy values from all other 99 calibration runs. Then (center) we compute a KS test from each test run Q with a random set of all but one calibration runs. Finally (right), we find a threshold τ on the distribution of p-values obtained from step 1 as the α quantile to guarantee a false positive rate of α. The power of the test is computed as the fraction of p-values computed from 100 test runs Q that are below τ .

C PROOFS

The full proof of Theorem 1 referenced in section 3 From Constrained Disagreement to Detecting Shift with Hypothesis Tests is presented below. After which, we prove a related result that invokes a Bayesian perspective on distribution shift to generate a tight bound on the probability of shift given observations made by CDCs on unseen data. Theorem 1 (Disagreement implies covariate shift). Let f be a classifier trained on dataset P consisting of samples drawn identically from P and their corresponding labels. Let g be a classifier that is observed to agree (classify identically) with f on P but and disagree on a dataset Q drawn from Q. If the rate which g disagrees with f on n unseen samples from Q is greater then that from n unseen samples from P w.p greater than p ⋆ := 1 2 1 -4 -n 2n n there must be covariate shift. Proof. Let R P = ∅ P 1 +. . .+∅ P n where each ∅ P i is an i.i.d Bernoulli random variable that describes the probability of g disagreeing with f on an unseen sample from P. Additionally, let R Q = ∅ Q 1 +. . .+∅ Q n be defined similarly for Q. If P and Q are equal then ∅ P i and ∅ Q i are equal by definition and the probability of observing R Q > R P is tightly bounded by Equation 9. This is the tightest upper bound that is not a function of E[∅ P i ], the proof can be found in Lemma 1. P = Q =⇒ P (R Q > R P ) ≤ 1 2 1 -4 -n 2n n = 1 2 -O 1 √ n (9) The more helpful contrapositive statement says that if it is sufficiently likely that R Q > R P , then covariate distributions P and Q must not be equal. P (R Q > R P ) > 1 2 1 -4 -n 2n n =⇒ P ̸ = Q (10) This result naturally lends itself to identifying P ̸ = Q by rejecting an exact statistical hypothesis that R Q = R P in favor of the alternative R Q > R P . Lemma 1. Let X and Y be iid binomial random variables with distribution Bin(n, p) then for all n ∈ Z ≥ 0: P (X > Y ) ≤ 1 2 1 -4 -n 2n n < 1 2 (11) Furthermore, eq. ( 11) is the tightest possible bound that does not depend on p. Proof. Let Z be the distribution X -Y , while Z itself is intractable to write down for arbitrary n, the characteristic function takes a convenient form ϕ Z (t; p) = E e it(x-y) = E e itx E e -ity (12) = 1 + p -1 + e it n 1 + p -1 + e -it n (13) = -p 2 e -it -p 2 e it + 2p 2 + pe -it + pe it -2p + 1 n (14) = (1 -2p + p cos(t) -ip sin(t) + p cos(t) + ip sin(t) (15) + 2p 2 -p 2 cos(t) + ip 2 sin(t) -p 2 cos(t) -ip 2 sin(t)) n = p 2 (2 -2 cos(t)) + p(2 cos(t) -2) + 1 n (16) Since X and Y are identically distributed P (Z > 0) = P (Z < 0) and so P (Z > 0) = 1 2 (1 -P (Z = 0)) Equation ( 17) suggests that a tight upper bound of the form P (Z = 0) ≥ α implies a tight lower bound in the form P (Z > 0) ≤ (1α)/2. To bound P (Z = 0) we first write it as an integral expression using the characteristic inversion formula for discrete random variables [Ushakov, 2011 ] P (Z = 0) = 1 2π π -π ϕ Z (t; p)dt Since ϕ Z (t; p) has the form (a(t)p 2 + b(t)p + 1) n where a and b are real valued functions and a(t) ≥ 0 ∀t ∈ R (i.e an integer power of a quadratic equation with positive leading coefficient), then for any choice of t, ϕ Z (t; p) will be globally minimized if and only if p → p ⋆ = -b(t)/ (2a(t)). For the particular form of ϕ Z (t; p), p ⋆ is simply 1/2 p ⋆ = - b(t) 2a(t) = - 2 cos(t) -2 2(2 -2 cos(t)) = 1 2 (19) This result is intuitive as the variance of a binomial distribution Bin(n, p) is maximized for any fixed choice of n when p = 1/2. We should note that Equation 19 appears problematic when cos(t) = 1, but in this case ϕ(t; p) becomes constant, hence p cannot influence the upper bound. We can now write the upper bound for P (Z = 0) P (Z = 0) = 1 2π π -π ϕ Z (t; p)dt (20) ≥ 1 2π π -π ϕ Z t; 1 2 dt (21) = 1 2π π -π 2 -n (cos(t) + 1) n dt (22) = 4 -n 2n n ( ) The final expression in eq. ( 23) can be found using the change of variables z = e it and Cauchy's residue theorem [Needham, 2000 ] I = 1 2π π -π 2 -n (cos(t) + 1) n dt (24) = - i 2π |z|=1 2 -n 1 z 1 + 1 + z 2 2z n dz (using t → -i log z) (25) = - i 2π |z|=1 4 -n z -n-1 (z + 1) 2n dz (simplifying) = Res 4 -n z -n-1 (z + 1) 2n , 0 (applying Cauchy's Theorem) (27) = 1 4 n n! lim z→0 d n dz n (z + 1) 2n (28) = 1 4 n n! 2n(2n -1)(2n -2) . . . (n + 1) (29) = 4 -n 2n n Combining eq. ( 17) with the bound from eq. ( 23) we arrive at the conjectured upperbound P (X > Y ) = P (Z > 0) = 1 2 (1 -P (Z = 0)) (31) ≤ 1 2 1 -4 -n 2n n by eq. (23) (32) (33) Finally we may use Sterling's approximation to show that 4 -n 2n n ∈ O 1 √ n and hence converges to 0 as n → ∞ leaving a limiting tight upper bound of 1/2. Theorem 2 (Probability of Disagreement: A Bayesian Perspective). Let f be a classifier trained on dataset P drawn from the distribution P over X and their corresponding labels. Let g be a classifier observed to agree with f on P but disagree on a dataset Q drawn from a distribution Q over X. We denote the true probabilities that g will disagree with f on a sample from P and Q as p and q, respectively. Under a uniform prior U(0, 1) for p and q, if we observe that g disagrees with f on m out of M iid samples from Q while disagreeing with n out of N iid samples from P, then the posterior probability that g is truly more likely to disagree with f on Q compared to P: P[q > p] =1 - (M + 1)!(N + 1)!(m + n + 1)! (m + 1)!n!(M -m)!(m + N + 2)! × (34) 3 F 2 (m + 1, m -M, m + n + 2; m + 2, m + N + 3; 1) Where p F q is the generalized hypergeometric function, implemented in several standard mathematical libraries. Proof. For simplicity we consider the function dis : X → {0, 1} that outputs 0 if f (x) = g(x) else 1. We define the true disagreement rates p and q as p := E x∼P [dis(x)] and q := E x∼Q [dis(x)] (35) Without any a-priori knowledge of dis we define the random variables p and q under uniform prior (i.e p, q i.i.d ∼ U(0, 1)) to encode our belief over the true values p and q. Now we draw N samples as By definition in Equation 35we know that n and m are draws from binomial distributions: N ∼ Bin(N, p) and M ∼ Bin(M, q) respectively. We can then compute the posterior probability density functions of p and q conditioned on the observations N = n and M = m using exact Bayesian inference. f p|n (x) := P[p = x|N = n] (37) = P[N = n|p = x] P[p = x] =1 1 0 P[N = n|p = x]dx -1 (38) = N n x n (1 -x) N -n 1 0 N n x n (1 -x) N -n dx -1 (39) = x n (1 -x) N -n   Bx(n + 1, -n + N + 1) incomplete beta function | x=1 x=0    -1 (40) = x n (1 -x) N -n     n!(N -n)! (N + 1)N ! x=1 -0 x=0     -1 (41) = N n x n (1 -x) N -n (N + 1) The integration in Equation 39 is solved using the definition of the incomplete beta function. Without loss of generality we may also find f q|m f q|m (x) = (M + 1) M m x m (1 -x) M -m Given these closed form posterior distributions for p and q we may compute the probability that the true value of q is greater then p Figure 7: Belief that the probability q that two classifiers disagree on samples from Q is greater then the probability p that they disagree on P given an observation of m disagreements out of M samples from Q and n of N disagreements on P. We plot this probability for M = 20 and N = 10000. We observe that even for a small test set size, M = 20, we can strongly believe that there is a true difference when m/M is only slightly larger then n/N . This integral, while daunting, can easily be solved in closed form using the free online Wolfram Mathematica cloud (result link). The solution is exactly Equation 34. P[q > p|N = n, M = m] = y>x f q|m (y)f p|n (x)dydx (44) = 1 0 1 x f q|m (y)f p|n (x)dydx (45) = M m N n (1 + M )(1 + N ) 1 0 (1 -x) -n+N x n 1 x (1 -y) -m+M y m dydx P[q > p|N = n, M = m] = 1 - (M + 1)!(N + 1)!(m + n + 1)! (m + 1)!n!(M -m)!(m + N + 2)! × (47) 3 F 2 (m + 1, m -M, m + n + 2; m + 2, m + N + 3; 1) To gain intuition, a graphical representation of Equation 47 is provided in Figure 7 . From a practical standpoint, if a practitioner trains two classifiers f and g that appear to disagree more often on a new dataset than a baseline rate computed on an in-domain test set, they can decide to act on that observation. (e.g., collected more training data) at a particular belief threshold (e.g., probability greater than 80%). Furthermore, there is a natural link between Equation 47 and covariate shift as exact knowledge of q > p by definition implies not only covariate shift P ̸ = Q, but a type that is harmful by our original definition in section 3. Therefore knowing the probability that q > p is a useful measure of how likely, without any additional assumptions, that we are experiencing a harmful covariate shift.

D CONSTRAINED DISAGREEMENT LEARNING

We elaborate on the technical details of the objective functions required for training constrained disagreement classifiers.

D.1 DISAGREEMENT CROSS ENTROPY

Intuition. In our methodology, we propose the disagreement cross entropy (DCE) as a simple loss function that encourages a classifier to disagree with a target label while otherwise outputting a high entropy prediction. DCE is equivalent to simply flipping the target label and computing the regular cross-entropy in the binary classification case. Consider a model that outputs a distribution {p 1 , p 2 , 1p 1p 2 } over 3 classes, suppose we would like to train it to not output high probability for class 3 i.e minimize p 3 = 1p 1p 2 . An intuitive approach would be to maximize the standard cross entropy losslog(1p 1p 2 ) that one would equivalently minimize if trying to output high probability for class 3. We show in Figure 8 that this objective is unstable whereas the DCE 1/2(log(p 1 ) + log(p 2 )) is convex, and takes on values in a similar range to the regular cross entropy. Loss Function. Let ŷ denote a discrete distribution over N classes and t a target class t ∈ {1, ..., N }. We define the disagreement cross entropy l as the cross entropy of ŷ with the uniform distribution with elements 1/(N -1) for all indices v i where i ̸ = y and v y = 0. v is a member of Q because it does not have a unique maximum at position y. Computing the DCE of v with respect to the target y gives: DCE(v, y) = 1 1 -N N i=1 log(v i )1 i̸ =y = 1 1 -N log 1 N -1 × (N -1) = log(N -1) Next note that since p y does not contribute to minimizing the DCE which is in the form p∈p\py log(p) the minimal solution must minimize the term p y . However to enforce p ∈ P we must have that ∀ i ∈ {1, . . . , N } \ y, p y > p i . Hence minimizing p y while making sure p ∈ P enforces p y > 1/N for some ϵ > 0 and all other p i < 1/N since if p y ≤ 1/N we could also chose some p i ≥ 1/N and produce a vector p ∈ Q. Without loss of generality choosing p y = 1/N + ε for some small ε > 0 we must chose all other p i = 1/Nϵ i s.t. all ϵ i > 0 and i∈{1,...,N }\y ϵ i = ε. This gives a DCE of: DCE(p, y) = 1 1 -N N i=1 log(p i )1 i̸ =y = 1 1 -N i∈{1,...,N }\y log 1 N -ϵ i where lim ε→0 1 1 -N i∈{1,...,N }\y log 1 N -ϵ i = log(N ) For all vectors p ∈ P we cannot achieve a DCE of less than log(N ) but the minimum DCE in Q is log(N -1) which is less than log(N ), hence proving that the DCE is a suitable scoring rule for disagreement. Implementation. We provide a simple PyTorch style implementation for batched computation of the CDC objective in Equation 3 given batch of logits and targets. We assume logits is a floating point vector of (batch size × N ) logit values, targets is a integer vector (batch size) of target classes, mask is a 0-1 mask of (batch size) that is 1 at index i if logits[i] is P and 0 otherwise. 

D.2 EXTENSION TO DISCRETE MODELS

When training models with arbitrary discrete or non-differentiable parameters concerning their objective (e.g., random forest), we must find a more general solution for creating CDCs. Such a solution should (1) reduce to the DCE when the model is, in fact, continuous and trained using the standard cross-entropy, and (2) reduces to label flipping when N = 2 (binary classification). Our simple solution is to replicate every sample in Q exactly N -1 times and create a unique label for each from the set S := {1, ..., N }\{t} where t is the disagreement target. We also each a sample a weight of 1/(N -1). In the case of N = 2, this corresponds to no replication and simply assigning the opposite label. In the case where the model learns by cross-entropy, it equals Equation 3. Proof. We prove this statement starting with the definition of the cross entropy CE(ŷ, y) = N c=1 1 c=y log(ŷ c ) (55) Now we consider the sum of the cross entropy for each label in S: y∈S CE(ŷ, y) = y∈S N c=1 1 c=y log(ŷ c ) (56) = N c=1 1 c̸ =t log(ŷ c ) (57) = (N -1)DCE(ŷ, y) Hence when giving each sample a weight of (N -1) -1 we recover the exact form of DCE.

D.3 DISAGREEMENT WITH OVERPARAMETERIZED MODELS

Many existing tests for covariate shift that rely on overparameterized models do not perform well in the small sample regime due to catastrophic overfitting. In this section, we explain how this phenomenon, if ignored, can also be catastrophic for the Detectron but is easily fixable with a simple early stopping technique. We recall the main hypothesis on which the Detectron is built: Given a base classifier f ∈ F that is well fit to a training dataset P, it is easier to learn another classifier g ∈ F that disagrees with f on unlabeled data Q, but agrees with f on P when the distribution of Q is far from that of P. 4 we record the associated CDC disagreement rate on both unseen in distribution data P ⋆ (CIFAR 10) and a known covariate shift Q (CIFAR 10.1). When we trained such an overparameterized model for too long the disagreement rate approaches 1 on both in and out-of-distribution datasets leading to a maximally uninformative test; it is, therefore, crucial to perform early stopping at or before the out-of-distribution disagreement rate reaches 1. Restated with the ideas introduced in this work, the above translates to the CDC objective being more easily satisfied when there is a harmful shift. This hypothesis is exemplified in Figure 9 to the right, where we see that the CDC disagreement rate grows significantly more rapidly when Q is chosen to be out-of-distribution (blue curve) versus in distribution (black curve). However, as the models are overparameterized, the disagreement rate eventually reaches 1 independently of whether we use the true OOD data Q or the ID data P ⋆ . Since our test is computed based on the disagreement rate ϕ Q falling above 1α quantile of the calibration distribution of ϕ P the test will be highly uninformative if ϕ Q = ϕ Q = 1. Note that it is critical to use the exact same training algorithm on both the given test set Q and the in-distribution calibration set P ⋆ to ensure the statistical soundness of the Detectron, so one can tune the CDC learning algorithm on either Q or P ⋆ to achieve a desired result then apply the exact same algorithm to the other set and test for statistical differences. In our experiments with overparameterized models, we picked a relatively low number of training iterations to use (see subsection E.2 for details) practically eliminating the issue of overfitting. However, in general, one should fix a training budget for CDCs so that they are far from reaching an ID disagreement rate of ϕ P = 1. Knowing the exact level of ϕ P to stop at will depend on the dataset and learning algorithm used but as a general guideline we suggest ≈ 0.5 to allow for a large gap between ID and OOD disagreement rates while giving enough of a compute budget to achieve non-trivial results.

E EXPERIMENTAL DETAILS E.1 BASE CLASSIFIERS

For each dataset used in our experiments, we begin by training a base classifier on the source domain portion of the dataset to use in subsequent experiments and baselines. For a brief description of the datasets used and base classifiers, see Table 4 and for a more detailed description of each dataset as well as what we have considered precisely as the source and shifted domains, see subsection F.2. CIFAR 10. We use a standard Resnet18 model pre-trained on ImageNet [Deng et al., 2009] made available in the torchvision library [Marcel and Rodriguez, 2010] although we reinitialize the last network layer to have an output size of 10. We use stochastic gradient descent (SGD) with a base learning rate of 0.1, L 2 regularization of 5 × 10 -4 , momentum of 0.9, a batch size of 128 and a cosine annealing learning rate schedule with a maximum 200 iterations stepped once per epoch for a total of 200 epochs. We use the standard CIFAR-10 training split normalized by its mean (µ = [0.4914, 0.4822, 0.4465] ) and standard deviation (σ = [0.2023, 0.1994, 0.2010] ). Every epoch, we randomly crop each image to a size of 32 × 32 after applying a 0 padding of four pixels to each spatial dimension, and we apply a horizontal flip with probability 0.5. This model achieves a test performance of 87%. While this score is far from state-of-the-art on CIFAR-10, our goal is not to construct a perfect model. We wish to create a reasonably good model as an example of a model that could realistically be deployed in real-world settings. When training deep ensembles, we only vary the random seed in the range [0, . . . , 4]. Camelyon 17. We follow a similar approach to CIFAR 10. However, we use two output features (for binary classification of cancerous or benign pathology), a batch size of 512, the ADAM optimizer [Kingma and Ba, 2015] with a base learning rate of 0.001, L 2 regularization of 10 -5 and a total of 5 training epochs for which we select the model with the best validation accuracy. This model achieves a test accuracy of 0.93. When training deep ensembles, we only vary the random seed in the range [0, . . . , 4]. UCI Heart Disease. We train both neural networks and gradient boosted trees using the XGboost library [Chen and Guestrin, 2016] . For the neural network model, we use a simple MLP with an input dimension of 9, 3 hidden layers of size 16 with ReLU activation followed by a 30% dropout layer and a linear layer to 2 outputs (heart disease present or not). We use 358 samples for training and 120 for validation. We train for a maximum of 1000 epochs and select the model with the highest AUC on the validation set, performing early stopping if the validation AUC has not increased in over 100 epochs. This model achieves a test AUC computed on 119 samples of 0.85. As with CIFAR 10 and Camelyon 17, we only vary the random seed in the range [0, . . . , 9] when training deep ensembles. Note that we chose a larger ensemble size here as models are fairly cheap to train. Another important trick when using small Q sizes is to sample all of Q in each batch filling the best with a random set of samples from P. Of procedure artificially inflates the size of Q so the hyperparameter λ must account for this by picking up an extra multiplicative factor equal to (batches per epoch) -1 . When training gradient boosted trees using XGboost we employ standard library parameters (η = 0.1, eval metric=auc, max depth= 6, subsample= 0.8, colsample bytree=0.8, min child weight= 1, objective=binary:logistic, num round= 10). This model while taking less then 5s to train achieves a test AUC of 0.88.

E.2 CONSTRAINED DISAGREEMENT CLASSIFIERS

We expand on the experimental details for learning constrained disagreement classifiers (??). When training a CDC g (f,P,Q) we start by creating a new dataset that combines all elements of the labeled set P and the unlabeled set Q with pseudo labels inferred by the base classifier f . We store a single bit for each sample in the combined dataset to indicate if a sample was originally drawn from P or Q. When training CDCs with neural networks, we use the DCE loss (Equation 3) under similar semantics as the pseudo-code implementation provided above. When training discrete models, we resort to our generalized approach in subsection D.2. To reduce training time we initialize g using the exact same architecture/weights as f and apply the exact same optimization algorithm/learning rate used to train f (see subsection E.1). For CIFAR 10, we train each CDC for a maximum of 10 epochs performing early stopping if the model drops in in-distribution validation performance by over 5%. We enforce the early stopping criteria to help prevent CDCs from overfitting to the disagreement loss when the target dataset has not come from a harmfully shifted domain. The intuition is the following: under the null, if a target dataset Q comes from the same distribution as a training dataset P, then learning to disagree with f on Q while constrained to agree on all of P can only be solved by overfitting to predict with high entropy on the specific examples in Q, versus learning a distinct pattern that distinguishes the distributions. Forcing a model to predict with high entropy on a subset of in-distribution datapoints can only hurt its associated in-distribution generalization, a phenomenon which we can directly assess by measuring validation performance. The details for training CDCs on Camelyon 17 are the same as those described for CIFAR 10, however due to the large training set size (302436 samples) we simply select a random subset of size 50, 000 as P at each epoch -a number we experimentally deemed as sufficient to achieve low in-distribution generalization error. When training CDCs on the UCI Heart Disease dataset, we use XGBoost [Chen and Guestrin, 2016] with the same hyperparameters described in subsection E.1. For the runtime experiment presented in Figure 4 we train each CDC for only one batch, where each batch contains a set of 100 samples Q and is filled up to a batch size of 512 with random samples from P train . After every batch we eliminate all samples where the CDC disagrees with the base predictions. We continue this for a maximum of 150 batches, but perform early stopping if 10 batches pass without at least one sample getting disagreed on.

E.3 DESCRIPTION OF BASELINE METHODS

We compare the Detectron against several methods for OOD detection, uncertainty estimation and covariate shift detection found in recent literature. 1. Deep Ensembles shown by Ovadia et al. [2019] to provide the most accurate estimates of predictive confidence under covariate shift. To compare directly with Detectron we test both the disagreements rates and the entropy distributions of the ensemble. See Appendix B for more information on how these tests are run. 2. Black Box Shift Detection (BBSD) [Lipton et al., 2018] is overall best method across numerous synthetic benchmarks for covariate shift detection evaluated by Rabanser et al.. We follow the same evaluation and perform a univariate KS test on each dimension of the softmax output of the base classifier between Q and a held out set from the training distribution. Bonferroni correction is used to compute a single p-value as the minimum value divided by the number of tests. We guarantee significance using the same permutation approach described in Appendix B. 3. Relative Mahalanobis Distance (RMD) [Ren et al., 2021] (a method designed specifically for identifying near OOD samples) using the penultimate layer of a pretrained model. We test for covariate shift by performing a KS test directly on the distribution of RMD confidence scores derived on Q and P ⋆ . 4. Classifier two sample test (CTST) [Lopez-Paz and Oquab, 2017] . Using the same architecture as and initialization as the base classifier we reconfigure the output layer and we train a domain classifier on half the test data with source data labeled as 0 and test data as 1. We then test this models accuracy on the other half of the test data and compare its performance to random chance using a binomial test (see Appendix B for more details). While this method is technically sound it is not suitable for the low data regime where learning a domain classifier on half the test data is unlikely to generalize beyond random performance on the other half. 5. Deep Kernel MMD [Liu et al., 2020] . We use the authors original source code available at https://github.com/fengliu90/DK-for-TST to perform the deep kernel MMD test. 6. H-Divergence [Zhao et al., 2022] . Most similar to our approach, this work proposes a two sample test based on the output of a learning model after training on either source or target data. Specifically, the authors fit a model to both the source dataset P, the target dataset Q and a uniform mixture (P + Q)/2. Under the null hypothesis P = Q the loss in each case is equal in expectation. However when P ̸ = Q, the generalized entropy of the mixture distribution may be be larger. In practice the authors fit three VAE [Kingma and Welling, 2014 ] models and compute the test statistic ℓ((P + Q)/2)min(ℓ(Q), ℓ(P)), where ℓ is the VAE loss computed as a sum of the binary cross entropy reconstruction loss and the KL divergence regularizer. The perform 100 runs where the null hypothesis (e.g. sample Q from P) and one where it does not. Significance is determined in the standard way be observing if the true test statistic exceeds the 95 th percentile of the test statistic distribution under the null hypothesis. Unfortunately this method, while state of the art on several benchmarks including the MNIST vs Fake MNIST two sample test, demonstrated low utility on more complex tasks with smaller sample sizes. After a discussion with the authors, we attempted to improve the results by first pretraining the VAE to produce valid samples and reconstructions under the source distribution and computing the H-Divergence statistic after finetuning. Despite this effort, we still was low statistical significance with small sample sizes likely due to the noisy nature of training VAE's in the low data regime. We use the authors original source code available here https://github.com/a7b23/H-Divergence.

E.4 COMPARISON TO GENERALIZATION ERROR PREDICTION

Related to distribution shift detection is the problem of estimating out of distribution generalization error on unlabeled data. Any method that estimates generalization error can be thought of as a regression from a dataset on a real valued test statistic ψ, which represents the estimated generalization error. As such we can calibrate the distribution of ψ based on unseen iid data in P ⋆ and test for distribution shift by determining if the predicted generalization error on Q is within the 5% extreme of the calibration distribution. Many recent methods [Platanios et al., 2016; Yu et al., 2022; Garg et al., 2022] have been proposed to address this problem. While Garg et al. propose the simplest approach their method is conceptually identical to BBSD when applied to the task of shift detection. Hence we compare only with the Projection Norm approach of Yu et al. as it presents the current state of the art and provides a well documents experimental repository. A comparison on the CIFAR10/10.1 benchmark is found in Table 2 and a further analysis of the scaling is given in Table 3 . Ultimately, the Projection Norm presents another useful approach for identifying covariate shift, however the computational complexity is at least equal or in most cases greater than the Detectron and the performance is in all cases lower. Table 2 : Comparison of Detectron (Entropy) and the Projection Norm [Yu et al., 2022] . We report the TPR@5 for both methods using sample sizes of 10, 20 and 50. |Q| = 10 |Q| = 20 |Q| = 50 ProjNorm [Yu et al., 2022] [Veeling et al., 2018] . Using the standard set by the WILDS framework [Koh et al., 2021] we use hospitals 1-3 as the source domain for training and validating models, and hospital 5 as the target domain for assessing distribution shift. refer to this region as R and note that characterizing it precisely is intractable as it depends on the complex interaction between a model, dataset, and learning algorithm. When detecting distribution shifts, we are primarily concerned with those that will harm the performance of a predictive model in deployment; those are shifts that are not simply any changes in the training distribution but ones that assign a high probability to samples outside of the generalization region. The Detectron aims to use not just a model and dataset to detect shifts but also the learning algorithm to tie the detection of shifts directly with R. While this argument is informal, we present an experiment to show that a non-trivial generalization region does exist in a simple machine learning task. Figure 13 shows that a LeNet-5 model [LeCun et al., 1989] trained on rotated MNIST images can match in-distribution performance when tested on various rotations outside of its training set. We train this model for ten epochs using the ADAM optimizer at a base learning rate of 0.001. By leveraging a model's existing learning algorithm when detecting shifts using the Detectron we can detect if new and unlabeled datasets are likely to belong to R by training a new model that is constrained to learn the same behavior over R while encouraging it to predict randomly otherwise. While this experiment is simplistic, it shows that learning algorithms can generalize to sets outside their training distribution. H HARMFUL COVARIATE SHIFT AND CONNECTION TO A DISTANCE Given a labeled training set P as well as another labeled dataset Q, one can identify using standard statistical estimation if a model f performs more poorly on Q compared to P. However, in a practical scenario, decision models are deployed on unlabeled datasets; hence directly computing model performance is impossible. To decide then if Q has been drawn from a distribution that may cause f to fail, we formulate an adversarial learning style definition of harmful covariate shift that does not require access to labeled examples. Definition: (ℓ, α, F )-Harmful Covariate Shift. A covariate shift from distributions P → Q over X is (ℓ, α, F )-harmful with respect to a set of decision models F , if there exists any subset of models of two or more models f in F that achieve a source domain loss ℓ(f, P) ≤ α for all f ∈ f while being more likely to disagree with each other on an unseen sample from Q compared to P. ∃ f ⊆ F, s.t. ∀f ∈ f ℓ(f, P) ≥ α and (59) P x∼Q (∃ f i , f j ∈ f s.t. f i (x) ̸ = f j (x)) > P x∼P (∃ f i , f j ∈ f s.t. f i (x) ̸ = f j (x)) In plainer words, we define harmful covariate shift based on the existence of multiple good models on P that tend to disagree on Q. The Detectron algorithm is designed to learn these models (constrained disagreement classifiers) and statistically test their disagreement rates. We can connect our definition of harmfulness to the well-studied concept of A distance from Kifer et al. [2004] . The A is a generalization of the total variation to an arbitary collection of measurable events A. d A (P, Q) = 2 sup A∈A |Pr D [A] -Pr D ′ [A]| Ben-David et al. [2006] shows that when they chose a class of events whose characteristic functions are functions in F , the A distance in connection with VC theory [Vapnik, 1995] allows for finite sample generalization bounds on the performance of arbitrary decision models from F under covariate shift. Ben-David et al. [2006] go on to show that the A distance defined for a binary function class F is equal to d F (P, Q) = 2 1 -2 min f ∈F err(f ) where min f ∈F err(f ) is the minimum error that a domain classifier from F can achieve on the task of distinguishing samples from P and Q (i.e. if P = Q the best domain classifier will have error of 0.5 and d F (P, Q) = 0 and if P and Q can be perfectly discriminated by some f ∈ F the d F (P, Q) is maximized and equal to 2). In our characterization of harmful covariate shift, we consider not just the discriminative power of F but the broader generalization region (Appendix G) induced by training f to achieve a certain source domain loss on P. For instance, if a model naturally learns rotational invariance, as shown in Figure 13 , one would also want to use a shift detector that will not detect shifts that only comprise of rotations. Beyond the concept of harmfulness, we empirically show that learning to detect shifts using CDCs instead of domain classifiers improves shift detection performance.

I ON MODEL COMPLEXITY AND THE DETECTRON

In our methodology, we aim to specifically detect covariate shifts that lead to unpredictable behavior of models over a given function class F . To better understand how the complexity of F changes the behavior of the Detectron, we present a simple example in Figure 14 . The Detectron is designed to work in the central and right case of Figure 14 (e.g., where we learn a function class F that contains the true underlying mechanism that labels our data). However, the types of shifts considered harmful will change with the complexity of F . In the ideal case where the model class complexity matches that of the true mechanism, we see that the model family does not allow significant variation on the decisions for points outside but close to the training distribution meaning the Detectron will not, by design, detect nearby shifts. When models are overly complex, more types of shifts should naturally be considered harmful as there is more variation than models from F can possess outside of the training domain, meaning we can never guarantee that we have chosen the correct model. In the last case, where the model family is too simple to contain the true mechanism, the Detectron loses its power to identify shifts that may result in performance penalties, as would any comparable method from OOD/uncertainty estimation that relies on a sufficiently well-calibrated classifier. The inability of our approach to handle this case is an unavoidable limitation of our methodology. However, models that are simpler than their underlying mechanism will often exhibit poor held performance even on in-distribution tasks, drastically limiting the likelihood of deployment.



Code available at https://github.com/rgklab/detectron See Appendix G for further intuition and a concrete example of the generalization set. A discussion on the relation between model complexity of F and the behaviour of CDCs and associated limitations of our defintion is given in Appendix I.



Finetune f → g Q to disagree on observed data {x1, . . . , xm} ∼ iid QX Finetune f → g P to disagree on i.i.d data {x1, . . . , xm} ∼ iid PX Compare disagreement rates φP = PX [g P (x) = f (x)] φQ = QX [g Q (x) = f (x)] as a test for covariate shift P t e x i t s h a 1 _ b a s e 6 4 = " + c / Z b q r X T O c W i i D e 3 G v / r Z C d + f 0= " > A A A B + 3 i c b V D L S s N A F L 2 p r 1 p f s S 7 d D B b B V U l E 1 G X R j c s K 9 g F N C J P p t B 0 6 m Y S Z i V h C f s W N C 0 X c + i P u / B s n b R b a e m D g c M 6 9 3 D M n T D h T 2 n G + r c r a + s b m V n W 7 t r O 7 t 3 9 g H 9 a 7 K k 4 l o R 0 S 8 1 j 2 Q 6 w o Z 4 J 2 N N O c 9 h N J c R R y 2 g u n t 4 X f e 6 R S s V g 8 6 F l C / Q i P B R s x g r W R A r v u J R M W Z F 6 E 9 Y R g n r X z P L A b T t O Z A 6 0 S t y Q N K N E O 7 C 9 v G J M 0 o k I T j p U a u E 6 i / Q x L z Q i n e c 1 L F U 0 w m e I x H R g q c E S V n 8 2 z 5 + j U K E M 0 i q V 5 Q q O 5 + n s j w 5 F S s y g 0 k 0 V G t e w V 4 n / e I N W j a z 9 j I k k 1 F W R x a J R y p G N U F I G G T F K i + c w Q T C Q z W R G Z Y I m J N n X V T A n u 8 p d X S f e 8 6 V 4 2 L + 4 v G q 2 b s o 4 q H M M J n I E L V 9 C C O 2 h D B w g 8 w T O 8 w p u V W y / W u / W x G K 1 Y 5 c 4 R / I H 1 + Q O n P J T a < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = " W N x Z O t q t 9 a F U b Y d l K n A F S H M V t V A = " > A A A B + 3 i c b V D L S s N A F L 2 p r 1 p f s S 7 d B I v g q i R S 1 G X R j c s W 7 A O a E C b T S T t 0 M g k z E 7 G E / I o b F 4 q 4 9 U f c + T d O 2 i y 0 9 c D A 4 Z x 7 u W d O k D A q l W 1 / G 5 W N z a 3 t n e p u b W / / 4 P D I P K 7 3 Z Z w K T H o 4 Z r E Y B k g S R j n p K a o Y G S a C o C h g Z B D M 7 g p / 8 E i E p D F / U P O E e B G a c B p S j J S W f L P u J l P q Z 2 6 E 1 B Q j l n X z 3 D c b d t N e w F o n T k k a U K L j m 1 / u O M Z p R L j C D E k 5 c u x E e R k S i m J G 8 p q b S p I g P E M T M t K U o 4 h I L 1 t k z 6 1 z r Y y t M B b 6 c W U t 1 N 8 b G Y q k n E e B n i w y y l W v E P / z R q k K b 7 y M 8 i R V h O P l o T B l l o q t o g h r T A X B i s 0 1 Q V h Q n d X C U y Q Q V r q u m i 7 B W f 3 y O u l f N p 2 r Z q v b a r R v y z q q c A p n c A E O X E M b 7 q E D P c D w B M / w C m 9 G b r w Y 7 8 b H c r R i l D s n 8 A f G 5 w + o w p T b < / l a t e x i t > Q < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 h T 7 X + 9 A N i n v S E W f d 2 g d S 1 H 1 Q w g = " > A A A C B H i c b V D L S s N A F L 2 p r 1 p f U Z f d D B b B V U l E 1 G X R j c s W 7 A P a U C b T a T t 0 M o k z E 6 G E L N z 4 K 2 5 c K O L W j 3 D n 3 z h p g 2 j r g Y E z 5 9 z L v f f 4 E W d K O 8 6 X V V h Z X V v f K G 6 W t r Z 3 d v f s / Y O W C m N J a J O E P J Q d H y v K m a B N z T S n n U h S H P i c t v 3 J d e a 3 7 6 l U L B S 3 e h p R L 8 A j w Y a M Y G 2 k v l 3 u B V i P C e Z J P e 0 J e o d + / o 2 0 b 1 e c q j M D W i Z u T i q Q o 9 6 3 P 3 u D k M Q B F Z p w r F T X d S L t J V h q R j h N S 7 1 Y 0 Q i T C R 7 R r q E C B 1 R 5 y e y I F B 0 b Z Y C G o T R P a D R T f 3 c k O F B q G v i m M l t R L X q Z + J / X j f X w 0 k u Y i G J N B Z k P G s Y c 6 R B l i a A B k 5 R o P j U E E 8 n M r o i M s c R E m 9 x K J g R 3 8 e R l 0 j q t u u f V s 8 Z Z p X a V x 1 G E M h z B Cb h w A T W 4 g T o 0 g c A D P M E L v F q P 1 r P 1 Z r 3 P S w t W 3 n M I f 2 B 9 f A N B A 5 h / < / l a t e x i t > P 6 = Q < l a t e x i t s h a 1 _ b a s e 6 4 = " N 4 z 1 Y P J I R M g h s f B z H S Z O q 5 F L 9 / Q = " > A A A C W H i c b Z F d S 8 M w F I b T u r m t f t V 5 6 U 1 w C F 6 N V k R F F I b e 7 H I D 9 w F r K W m W b W H p B 0 k q j N I / K X i h f 8 U b 0 6 2 M u e 1 A 4 O U 5 5 y U n b / y Y U S E t 6 1 v T D 0 r l w 0 q 1 Z h w d n 5 y e m e f 1 v o g S j k k P R y z i Q x 8 J w m h I e p J K R o Y x J y j w G R n 4 8 7 e 8 P / g g X N A o f J e L m L g B m o Z 0 Q j G S C n l m 5 A R I z j B i a T v z r C c H O v G M e m v Y y e D L C q V r 1 s 0 y x z E 2 f W i f 7 3 m v z z M b V t N a F t w V d i E a o K i O Z 3 4 6 4 w g n A Q k l Z k i I k W 3 F 0 k 0 R l x Q z k h l O I k i M 8 B x N y U j J E A V E u O k y m A x e K z K G k 4 i r E 0 q 4 p J u O F A V C L A J f T e Z b i u 1 e D v f 1 R o m c P L o p D e N E k h C v L p o k D M o I 5 i n D M e U E S 7 Z Q A m F O 1 a 4 Q z x B H W K q / M F Q I 9 v a T d 0 X / t m n f N + + 6 d 4 3 W a

Figure 1: Overview of the Detectron: Starting with a base classifier f trained on labeled samples from distribution P we train new Constrained Disagreement Classifiers (CDCs) g P and g Q on a small sets of unseen samples from P as well as a unknown distribution Q. CDCs aim to maximize classification disagreement on unseen data while constrained to classify consistently with f on their original training set. The rate ϕ that CDCs disagree is a powerful and sample efficient statistic for identifying covariate shift P ̸ = Q.

The Detectron algorithm for detecting harmful covariate shift Input: P: labeled dataset, Q: unlabeled dataset, L: learning algorithm, K: calibration rounds = 100, ℵ: ensemble size = 5, α: significance level = 0.05 Output: test result for harmful covariate shift at significance level α P train , P val , P ⋆ ← Partition(P) N ← |Q|; Φ P ← [ ] f ← L(P train , P val ) // Load or train a base classifier on P repeat p ⋆ ← RandomSample(P ⋆ , N ) // Train an ensemble of CDCs on P ⋆ while n > 0 and iterations ≤ ℵ do g ← ConstrainedDisagreement(L, P train , P val , p ⋆ , f ) // See Appendix TODO p ⋆ ← {x | x ∈ p ⋆ and f (x) = g(x)} // Filter out disagreed on data ϕ P ← 1 -|p ⋆ |/N // Update disagreement rate end Append ϕ P to Φ P until K iterations elapse // Train an ensemble of CDCs on Q while

Figure 2: Ensemble Size vs Properties of Constrained disagreement classifiers on CIFAR-10/10.1: (Left)

) Classifier Two Sample Test (CTST) [Lopez-Paz and Oquab, 2017]. (6) Deep Kernel MMD (MMD-D) [Liu et al., 2020]. (

Figure 3: True positive rate at the 5% significance level for the Detectron and baseline methods for detection of covariate shift on the UCI heart disease dataset. The Detectron (Entropy) is shown to uniformly outperform baselines. Confidence intervals are excluded for visual clarity but are found in Table1.

Figure 4: Runtime Characteristics: We train 100 random runs of CDCs on 100 samples from CIFAR 10 and 10.1 and compute the disagreement satistic as the difference ψ := E[ϕ Q -ϕ P ]. While we see that while ψ peaks near 50 training batches, only 10 batches are required for the Detectron disagreement test to reach an area under the TPR vs FPR curve (AUROC) of nearly 1 (i.e., perfect discrimination). Training CDCs for too long eventually lowers ψ as E[ϕ Q ] ≈ E[ϕ P ] ≈ 1 meaning CDCs eventually overfit to disagreeing on all of their data.

Generalization and Model Complexity: The definition of the Detectron specifies the use of the same function class for identifying shift as is used in the original prediction problem. The Detectron may fail to detect harmful shifts in cases where the base model is learned from an underspecified function class. Appendix I provides additional context, examples and ways to mitigate this limitation. The precise relationships between model complexity, generalization error, and test power are interesting directions for future work.

Compute the validation performance of f while M(f, P val ) > m 0ε and iterations < k do // Training epoch over PQ for batch ∈ PQ do x P , y P ← {(x, y) | (x, y) ∈ batch and (x, y) ∈ P train } x Q , y Q ← {(x, y) | (x, y) ∈ batch and (x, y) ∈ Q} // Update g using an existing learning algorithm for (xP , yP ) and the appropriate disagreement update for (xQ, yQ)

Figure 5: The Detectron disagreement test: In this example (taken from our experiment where P = CIFAR10 and Q = CIFAR10.1 and sample size N = 50) pictured we start by training an ensemble of CDCs (we use and ensemble size of 5) to reject/disagree on a set of N unseen samples from the original training distribution (P ⋆ ) while constrained to perform consistently with a base model on the original training and validation sets used to train the base model on CIFAR10.We perform 100 of these calibration runs using different random seeds and samples for P ⋆ to estimate a threshold τ such that 95% of the runs reject fewer then τ samples -thereby fixing the significance level of the test to 5%. To estimate the test power, we train CDCs using the exact same configuration as the calibration runs except we replace P ⋆ with a random set of N samples Q from Q (CIFAR 10.1). By averaging the number of runs the reject more than τ samples we can compute the power (or true positive rate) of the test for the configuration.

and M as x i.i.d∼ Q M and compute the number of times dis(x) equals 1 on each set.

Figure 9: CDC Training Dynamics for Overparameterized Models: Using the same experimental setup as the runtime study in Figure4we record the associated CDC disagreement rate on both unseen in distribution data P ⋆ (CIFAR 10) and a known covariate shift Q (CIFAR 10.1). When we trained such an overparameterized model for too long the disagreement rate approaches 1 on both in and out-of-distribution datasets leading to a maximally uninformative test; it is, therefore, crucial to perform early stopping at or before the out-of-distribution disagreement rate reaches 1.

Figure 10: Cifar 10 vs Cifar 10.1. (Image borrowed from the technical report "Do CIFAR-10 Classifiers Generalize to CIFAR-10?" [Recht et al., 2018])

Figure13: We train a CNN with the LeNet-5 architecture[LeCun et al., 1989] on a dataset consisting of MNIST images rotated from 0 to 80 degrees counterclockwise in steps of 20. We test the model using rotations of -25 to 125 degrees in steps of 5. We observe that the model achieves nearly identical test accuracy on all test angles between -5 and 85, indicating that it has generalized to angles outside of its training set.

Results (true positive rate at the 5% significance level) for detection of harmful covariate shift on CIFAR-10.1, Camelyon 17 and UCI Heart Disease benchmarks. The best result for each column is bolded, results that are within 2% of the best are underlined and the best baseline method is italicized.

Algorithm 3: Constrained Disagreement Input: L: learning algorithm, P train : labeled training dataset {. . . , (x i , y i ), . . . }, P val : labeled validation dataset {. . . , (x i , y i ), . . . }, Q: unlabeled test dataset {. . . , x i , . . . }, f : classifier trained on P, M: evaluation metric (default accuracy), ε: tolerance (default 0.05), k: max epochs (default 10) Output: Constrained Disagreement Classifier g (f,P,Q)

Projection Norm results[Yu et al., 2022] on larger samples sizes of CIFAR10/10.1. We observe that ≈ 500 samples are required to reach near perfect test power.

Datasets: We investigate three different forms of covariate shift. To verify that these shifts are indeed harmful to the models, we report performance in both the shifted and unshifted domains. Examples and further descriptions of unshifted/shifted splits of each dataset are given in Appendix F.

acknowledgement

ACKNOWLEDGMENTS Tom Ginsberg's research was supported by a New Frontiers in Research Fund NFRFR-2022-00526, an LG research grant, and a Canada Graduate Scholarship (CGS-M). Rahul G. Krishnan was supported by a CIFAR AI Chair. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. Additional thanks is given to the many readers who provided valuable feedback: Vahid Balazadeh, Michael Cooper, Edward De Brouwer, Aslesha Pokhrel, Adnan Mohd, Ian Shi, Asic Chen and Stephan Rabanser.

Appendix

We observe that naively minimizing the negative cross entropy results in an unbounded local minima, while the DCE is significantly more stable and scales similarly to the regular cross entropy. Gradients are overlaid to help better visualize the 3D geometry.over all classes except t.This expression has the effect of minimizing the predicted probability of the target class ŷt while maximizing the overall entropy of the prediction. We note that if N = 2 as in binary classification, then Equation 48 falls back to simple label flipping. Furthermore if we let ŷ = Softmax(l) for some real valued N dimensional logit vector lWhen learning constrained disagreement classifiers using DCE, we minimize the loss function in Equation 3, which applies the regular cross-entropy to the in-distribution samples P and the DCE to the possibly shifted target samples Q. Losses are combined with a weighted sum whose weight should be set to 1/(|Q| + 1) as discussed in section 3 (Choosing λ).Validity of the DCE Measure. We introduce a dentition of a a valid disagreement loss function and show that the DCE loss satisfies it. Definition 1 (A Valid Disagreement Loss Function). Let P be the set of all probability vectors of length N whose maximum index is a fixed target label y ∈ {1, . . . , N } . Similarly, let Q be the set of all probability vectors whose maximum index is not uniquely equal to y. For instance, in the three-dimensional case with y = 3which is to say that for every probability vector in P there exists a probability vector in Q that achieves a score at least as low.Theorem 3. The Disagreement Cross Entropy Loss is a valid disagreement loss function.Proof. We show that the minimum DCE for a probability vector q ∈ Q is log(N -1) while the minimum DCE for p ∈ P is log(N ). First note that by definition of the DCE in Equation 48 the unique probability vector v that globally minimizes it with respect to a target class y is the vector Published as a conference paper at ICLR 2023 UCI Heart Disease. The UCI Heart Disease (UCI-HD) dataset [Janosi et al., 1988] consists of 76 attributes collected from four unique patient databases in Cleveland, Hungary, Switzerland, and the VA Long Beach. We select nine features out of the commonly used 14 to minimize the portion of missing values. These features are {age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina}. The prediction task is to determine the diagnosis of heart disease (also known as angiographic disease status), which is given in a range from 0-4, where 0 indicates healthy and 1-4 indicates a severity level based on the narrowing of major blood vessels. Following prior work [Chaki et al., 2015] we only consider the simplified binary classification task for differentiating patients with a normal angiographic status (label of 0) from those with abnormal status (label > 0). We select the source domain as the Cleveland and Hungary databases and the target domain as the Switzerland and VA Long Beach databases. A graphical overview of the marginal feature distributions for the source and target domain is shown in Figure 12 .To allow for out-of-the-box training of deep neural networks on the UCI HD dataset, we use the missing value synthesis functional in Wolfram Mathematica [Inc. ]. The algorithm uses density estimation and mode finding on conditioned distributions to synthesize missing values. See the language guide page titled Synthesize Missing Values in Numeric Data for a more detailed description. To aid in future research, we provide a copy of our processed dataset in <our github repo>/data/uci heart.pt.A summary of the three datasets used as well as the description of shifts and effects on model performance is provided in Table 4 .

G EXPANDING ON GENERALIZATION REGIONS

We expand on the concept of the generalization region R proposed in section 3. We highlight that not all covariate shifts will be harmful to the performance of a model, as in many cases, the model will generalize to a region more extensive than the support of the training distribution. We informally Published as a conference paper at ICLR 2023 Figure 14 : Investigating the relation between CDC model complexity and the ability to identify covariate shift. We consider a toy example where the ground truth labels are generated using a quadratic decision boundary, shown as a black dashed line. The blue points correspond to training samples, and the orange/green points correspond to two different covariate shifts, one closer to the training distribution and the other further. (Left) When we choose an overly simplified model family (e.g., linear classifiers), there exists no CDC that reports different explanations for the orange and green points. (Center) When we choose a quadratic function family, there exists enough variation within the space of models that explain the training set to offer different explanations on the distance shift (orange) but not on the near shift (green). (Right) When we learn from an overly expressive function family (polynomials of degree 3+), the space of models that explain the training set can offer different explanations of even near covariate shifts (green points).In general, choosing a model family that is well specified for a given task is nontrivial and still an open question in machine learning. This concern is somewhat diminished in cases where models are over-parameterized (e.g., deep neural networks) or models are built using expert knowledge related to the true underlying mechanism, as is often the case in practical machine learning problems.

