ADAPTATION TO LABEL-SHIFT IN THE PRESENCE OF CONDITIONAL-SHIFT

Abstract

We consider an out-of-distribution setting where trained predictive models are deployed online in new locations (inducing conditional-shift), such that these locations are also associated with differently skewed target distributions (labelshift). While approaches for online adaptation to label-shift have recently been discussed by Wu et al. (2021), the potential presence of concurrent conditionalshift has not been considered in the literature, although one might anticipate such distributional shifts in realistic deployments. In this paper, we empirically explore the effectiveness of online adaptation methods in such situations on three synthetic and two realistic datasets, comprising both classification and regression problems. We show that it is possible to improve performance in these settings by learning additional hyper-parameters to account for the presence of conditional-shift by using appropriate validation sets.

1. INTRODUCTION

We consider a setting where we have black-box access to a predictive model which we are interested in deploying online in different places with skewed label distributions. For example, such situations can arise when a cloud-based, proprietary service trained on large, private datasets (like Google's Vision APIs) serves several clients real-time in different locations. Every new deployment can be associated with label-shift. Recently, Wu et al. (2021) discuss the problem of online adaptation to label-shift, proposing two variants based on classical adaptation strategies -Online Gradient Descent (OGD) and Follow The Leader (FTH). Adapting the output of a model to a new label-distribution without an accompanying change in the label-conditioned input distribution only requires an adjustment to the predictive distribution (in principle). Therefore, both methods lend themselves to online black-box adaptation to label-shift, which makes on-device, post-hoc adjustments to the predictive distribution feasible under resource constraints. In this paper, we empirically explore such methods when the underlying assumption of an invariant conditional distribution is broken. Such situations are likely to arise in reality. For example, in healthcare settings there are often differing rates of disease-incidence (label-shift) across different regions (Vos et al., 2020) accompanied by conditional-shift in input features at different deployment locations, for example in diagnostic radiology Cohen et al. (2021) . In notation, for input variable x and target variable y, we have that P new (x | y) ̸ = P (x | y) and P new (y) ̸ = P (y), for a training distribution P and a test distribution P new in a new deployment location. Contributions Our contributions are as follows. • We conduct an empirical study of the FTH and OGD methods introduced by Wu et al. (2021) in black-box label-shift settings with concurrent conditional-shift, a situation likely to arise in realistic deployments. • We explore the question of how to potentially improve performance in such practical settings by computing confusion matrices on OOD validation sets, and show that adding extra hyper-parameters can contribute to further improvements. • We reinterpret a simplified variant of FTH under a more general Bayesian perspective, enabling us to develop an analogous baseline for online adaptation in regression problems. We begin with a brief review of online adaptation methods for label-shift for classification problems, based on the recent discussion in Wu et al. (2021) . While their motivation is temporal drift in label-distributions, we consider the case where a single model is serving several clients online in different locations, each with their own skewed label-distribution that does not change even further with time. If the training set label-distribution is P (y) and the label-distribution in the new location is P new (y), and if we assume P new (x | y) = P (x | y), then the following holds P new (y | x) = P (x | y)P new (y) P new (x) = P (y | x)P (x) P (y) P new (y) P new (x) ∝ P new (y) P (y) P (y | x), i.e., the location-adjusted output distribution is simply a reweighting of the output distribution from the base underlying predictive model. Wu et al. ( 2021) follow along past work on label-shift adaptation by restricting the hypothesis space for f to be that of re-weighted classifiers, since Eq. 1 implies that one only needs to re-weight the predictive distribution to account for label-shift. The parameter vector for this classifier is simply the vector of probabilities in P new (y), henceforth referred to as p, and we will similarly use q to represent the training-set probability distribution, P (y). Given an underlying predictive model f , the adjusted classifier rule is therefore given by g(x; f, q, p) = arg max y∈[K] p[y] P f (y | x) q[y] , where P f (y | x) is the predictive distribution produced by an underlying base model f ; for example, a softmax distribution produced by a neural network, and there are K classes in our dataset.

2.1. ONLINE ADAPTATION ALGORITHMS

Wu et al. ( 2021) present two online updating methods to estimate p -Online Gradient Descent (OGD) and Follow The History (FTH). If we assume knowledge of a confusion matrix for a classifier f in a new location, Wu et al. (2021) show that the expected error rate in this new location can be derived as a function of the label-distribution P new (y). If we represent P new (y) as a K-dimensional probability vector q new , the expected error rate is given as C new (f ) ∈ R K×K , such that C new f [i, j] = P x∼P new (x|y=i) (f (x) = j), then ℓ new (f ) = K i=1 1 -P x∼P new (x|y=i) (f (x) = i) • q new [i] = ⟨1 -diag(C new f ), q new ⟩, where 1 is the all-ones vector. Since we have assumed no conditional-shift so far, C new f = C f , i.e. the confusion matrix remains invariant under label-shift. This implies one can optimize the expected error rate in the new deployment location using a confusion matrix estimated from a large in-distribution validation set, C f , in place of C new f in Eq. 3. Online Gradient Descent (OGD) Assuming that diag(C f ) is differentiable wrtf , we can update f to minimize the expected error rate. We would typically not be aware of the true label-distribution in the new deployment location. However, when the confusion matrix C f is invertible, we can compute an unbiased estimate of this distribution, given as qnew = C ⊤ f -1 e, where e is a one-hot vector for the predicted category. Using this, Wu et al. ( 2021) present an unbiased gradient of ℓ new (f ), ∇ f lnew (f ) = E P new ∂ ∂f [1 -diag(C f )] ⊤ • qnew . (4) When the hypothesis space is restricted to the space of re-weighted classifiers g (Eq. 2) this gradient is only over p. Wu et al. (2021) show how we might use effective numerical methods to estimate this gradient. In the online setting, p is updated after seeing new examples, hence the t + 1-th gradient update is performed by computing the gradient at the current point p t , followed by a projection to the probability simplex, ∇ p lnew (p) p=pt = E P new ∂ ∂p [1 -diag(C g )] ⊤ • qnew p=pt (5)

