ADAPTATION TO LABEL-SHIFT IN THE PRESENCE OF CONDITIONAL-SHIFT

Abstract

We consider an out-of-distribution setting where trained predictive models are deployed online in new locations (inducing conditional-shift), such that these locations are also associated with differently skewed target distributions (labelshift). While approaches for online adaptation to label-shift have recently been discussed by Wu et al. (2021) , the potential presence of concurrent conditionalshift has not been considered in the literature, although one might anticipate such distributional shifts in realistic deployments. In this paper, we empirically explore the effectiveness of online adaptation methods in such situations on three synthetic and two realistic datasets, comprising both classification and regression problems. We show that it is possible to improve performance in these settings by learning additional hyper-parameters to account for the presence of conditional-shift by using appropriate validation sets. We consider a setting where we have black-box access to a predictive model which we are interested in deploying online in different places with skewed label distributions. For example, such situations can arise when a cloud-based, proprietary service trained on large, private datasets (like Google's Vision APIs) serves several clients real-time in different locations. Every new deployment can be associated with label-shift. Recently, Wu et al. ( 2021) discuss the problem of online adaptation to label-shift, proposing two variants based on classical adaptation strategies -Online Gradient Descent (OGD) and Follow The Leader (FTH). Adapting the output of a model to a new label-distribution without an accompanying change in the label-conditioned input distribution only requires an adjustment to the predictive distribution (in principle). Therefore, both methods lend themselves to online black-box adaptation to label-shift, which makes on-device, post-hoc adjustments to the predictive distribution feasible under resource constraints. In this paper, we empirically explore such methods when the underlying assumption of an invariant conditional distribution is broken. Such situations are likely to arise in reality. For example, in healthcare settings there are often differing rates of disease-incidence (label-shift) across different regions (Vos et al., 2020) accompanied by conditional-shift in input features at different deployment locations, for example in diagnostic radiology Cohen et al. (2021) . In notation, for input variable x and target variable y, we have that P new (x | y) ̸ = P (x | y) and P new (y) ̸ = P (y), for a training distribution P and a test distribution P new in a new deployment location. Contributions Our contributions are as follows. • We conduct an empirical study of the FTH and OGD methods introduced by Wu et al. (2021) in black-box label-shift settings with concurrent conditional-shift, a situation likely to arise in realistic deployments. • We explore the question of how to potentially improve performance in such practical settings by computing confusion matrices on OOD validation sets, and show that adding extra hyper-parameters can contribute to further improvements. • We reinterpret a simplified variant of FTH under a more general Bayesian perspective, enabling us to develop an analogous baseline for online adaptation in regression problems.

2. BACKGROUND

We begin with a brief review of online adaptation methods for label-shift for classification problems, based on the recent discussion in Wu et al. (2021) . While their motivation is temporal drift in label-distributions, we consider the case where a single model is serving several clients online in different locations, each with their own skewed label-distribution that does not change even further with time. If the training set label-distribution is P (y) and the label-distribution in the new location is P new (y), and if we assume P new (x | y) = P (x | y), then the following holds P new (y | x) = P (x | y)P new (y) P new (x) = P (y | x)P (x) P (y) P new (y) P new (x) ∝ P new (y) P (y) P (y | x), i.e., the location-adjusted output distribution is simply a reweighting of the output distribution from the base underlying predictive model. Wu et al. (2021) follow along past work on label-shift adaptation by restricting the hypothesis space for f to be that of re-weighted classifiers, since Eq. 1 implies that one only needs to re-weight the predictive distribution to account for label-shift. The parameter vector for this classifier is simply the vector of probabilities in P new (y), henceforth referred to as p, and we will similarly use q to represent the training-set probability distribution, P (y). Given an underlying predictive model f , the adjusted classifier rule is therefore given by g(x; f, q, p) = arg max y∈[K] p[y] P f (y | x) q[y] , where P f (y | x) is the predictive distribution produced by an underlying base model f ; for example, a softmax distribution produced by a neural network, and there are K classes in our dataset. 2.1 ONLINE ADAPTATION ALGORITHMS Wu et al. (2021) present two online updating methods to estimate p -Online Gradient Descent (OGD) and Follow The History (FTH). If we assume knowledge of a confusion matrix for a classifier f in a new location, C new (f ) ∈ R K×K , such that C new f [i, j] = P x∼P new (x|y=i) (f (x) = j), then Wu et al. (2021) show that the expected error rate in this new location can be derived as a function of the label-distribution P new (y). If we represent P new (y) as a K-dimensional probability vector q new , the expected error rate is given as ℓ new (f ) = K i=1 1 -P x∼P new (x|y=i) (f (x) = i) • q new [i] = ⟨1 -diag(C new f ), q new ⟩, where 1 is the all-ones vector. Since we have assumed no conditional-shift so far, C new f = C f , i.e. the confusion matrix remains invariant under label-shift. This implies one can optimize the expected error rate in the new deployment location using a confusion matrix estimated from a large in-distribution validation set, C f , in place of C new f in Eq. 3. Online Gradient Descent (OGD) Assuming that diag(C f ) is differentiable wrtf , we can update f to minimize the expected error rate. We would typically not be aware of the true label-distribution in the new deployment location. However, when the confusion matrix C f is invertible, we can compute an unbiased estimate of this distribution, given as qnew = C ⊤ f -1 e, where e is a one-hot vector for the predicted category. Using this, Wu et al. (2021) present an unbiased gradient of ℓ new (f ), ∇ f lnew (f ) = E P new ∂ ∂f [1 -diag(C f )] ⊤ • qnew . (4) When the hypothesis space is restricted to the space of re-weighted classifiers g (Eq. 2) this gradient is only over p. Wu et al. (2021) show how we might use effective numerical methods to estimate this gradient. In the online setting, p is updated after seeing new examples, hence the t + 1-th gradient update is performed by computing the gradient at the current point p t , followed by a projection to the probability simplex, ∇ p lnew (p) p=pt = E P new ∂ ∂p [1 -diag(C g )] ⊤ • qnew p=pt (5) p t+1 = Proj ∆ K-1 p t -η • ∇ p lnew (p) p=pt , where η is the learning rate and Proj is the projection operator. Follow The History (FTH) The update rule for p t in FTH is simpler and more efficient (in terms of memory and time complexity), given by p t+1 = 1 t t τ =1 qnew τ , where qnew τ is the estimate for the label distribution at the τ -th iteration. Empirical evidence in Wu et al. (2021) suggests that FTH performs very competitively with OGD, and might be preferred in highly resource-constrained settings.

3. UNMET ASSUMPTIONS IN PRACTICE

We now consider applying the above strategies in cases where some of the assumptions in the above section are broken. While it is difficult to make conclusive theoretical statements in situations when these assumptions break, we propose some heuristics which we evaluate empirically.

3.1. THE ASSUMPTION OF INVARIANT P (x | y) CAN BREAK

In realistic deployments in new locations, it is likely that along with a differently skewed labeldistribution, the conditional distribution will change as well, i.e. P new (x | y) ̸ = P (x | y). In our study, we will assume that this distributional shift only takes place within the same domain, and along (potentially spuriously-correlated) non-semantic features, leaving the semantic features intact, a setting likely to be manifested in different deployment locations. HEURISTIC 1 One possibility to adapt the above methods to settings with concurrent conditionalshift is to estimate the confusion matrix on an OOD validation set. Intuitively, an IID-estimated confusion matrix is likely to be over-confident, and a surrogate-OOD validation set can better reflect performance at test-time OOD settings.

HEURISTIC 2

We propose to add extra scaling hyper-parameters in the decision rule in Eq. 2. Specifically, we add the scaling hyper-parameters λ u and λ y before making a test prediction, g(x; f, q, p) = arg max y∈[K] log P f (y | x) + λ u log p[y] -λ y log q[y], where we have rewritten the rule in log-space. In this formulation, log P f (y | x) = logit[y] -Z(x), so we can drop the normalizing term. This results in a predictive rule that is a form of logitadjustment (Menon et al., 2021) . Intuitively, these hyper-parameters play the role of determining how much of the training prior to "subtract", and how much weight to assign to the pseudo-label based re-adjustment. When these magnitudes are learned on validation sets representing a combination of label-shift and conditional-shift, one can hope to further improve at novel test-time deployments.

3.2. CONFUSION MATRICES CAN BE NON-INVERTIBLE

Existing work on label-shift based on confusion matrices rely on a significantly large held-out validation set to estimate a robust confusion matrix. When the underlying dataset is highly classimbalanced with several categories and limited-size validation sets, one can easily end up with a non-invertible confusion matrix. Lipton et al. (2018) suggests two main possibilities -use of a soft confusion matrix, or a pseudo-inverse. In our experiments on a large-scale realistic dataset, we find both choices to lead to degraded performance. We find that simply using an identity matrix approximation can recover some of the performance drops (see Appendix E). When using FTH with an identity C f , this corresponds to simply using the pseudo-labels up to time t to estimate the label-distribution. However, naively using the identity matrix in Eq. 7 might lead to a practical problem: after seeing the first data-point, p would be a one-hot vector, and thus enforce the same prediction at the next iteration when using Eq. 2. A fix would be to use a "pseudo-count" to smooth initial conditions, which is reminiscent of Bayesian posterior updates. In the next section, we use this realization as a starting point to suggest a simpler as well as more general framework. This framework then enables us to develop an equivalent online label-shift adaptation method for regression problems.

4. A BAYESIAN PERSPECTIVE

If we use the vector α to keep online counts of predictions, with an initialized α 0 , such that α t [k] = t τ =1 1[ŷ τ = k] + α 0 = 1[ŷ t = k] + α t-1 [k], then using an identity confusion matrix in Eq. 7 corresponds to the following update rule, p t+1 [k] = α t [k] K k ′ =1 α t [k ′ ] . ( ) We recognize that this update-rule corresponds exactly to the posterior predictive distribution computed using a Categorical likelihood with a Dirichlet prior, and using a recursive rule for updating the posterior. More precisely, if we use ϕ ∼ Dir(α), (11) y | ϕ ∼ Cat(ϕ), ( ) where ϕ ∈ ∆ K-1 are the parameters of the Categorical distribution, in the following update equations P t (ϕ) ∝ P (y t | ϕ) P t-1 (ϕ), P t+1 (y) = ϕ P (y | ϕ) P t (ϕ) dϕ, then we arrive at Eq. 10 using Eq. 14, and Eq. 9 using Eq. 13. See Appendix A for a derivation of Eq. 13. In practice, y t is not available to us, and we use the pseudo-label ŷt instead, as in FTH.

4.1. EXTENSION TO REGRESSION PROBLEMS

While adaptation for regression problems has been discussed more generally (Cortes & Mohri, 2011; 2014; Zhang et al., 2013) , an analogous discussion for online black-box label-shift adaptation is missing for regression. We adapt the general online update rules in Eq. 13, 14 for regression problems undergoing similar concurrent test-time distributional shifts. A natural choice is to use Gaussians to model the distributions over the continuous target variable, P f (y | x) ∝ exp - λ x 2 y -f (x) 2 , P (y) ∝ exp - λ y 2 y -m 2 , ( ) where λ x , λ y are the precision parameters and m is the training set mean. The parameters ϕ in Eq. 13 are now the mean and precision parameters for y in the new deployment location. We use the Normal-Gamma distribution to model the posterior over these parameters, since this is the conjugate distribution for Gaussians with unknown mean and precision (DeGroot, 2004) , P (µ new , λ new ) = N µ new | µ, 1 κλ new Ga(λ new | a, b). ( ) Combined with the Gaussian likelihood in Eq. 14, this yields P new (y) in the form of a Student's t-distribution, P new (y) ∝ 1 + L 2a (y -µ) 2 -2a+1 2 , Train (r = 0.99) Validation (opposite colors with r = 0.75) where 2a is the number of degrees of freedom, and L = aκ b(κ+1) . Using these, our predictive function (in log-space) takes the form Test (r = -1.0) arg min y λ x 2 y -f (x) 2 - λ y 2 y -m 2 + 2a + 1 2 log 1 + L 2a (y -µ) 2 . ( ) Setting the derivative wrt y to zero yields a cubic equation (see Appendix B.1), which we can solve to find roots. A positive sign of the second derivative of the objective tells us if a solution is a (local) minima. When we have one real solution with a positive second derivative, we use this; when we have multiple real solutions with positive second derivatives, we pick the one that corresponds to the smallest objective; when we have no real solutions with positive second derivatives, we do not update P(y | x), retaining f (x) as the solution. Empirically, we find that the condition for no local minima does not arise for optimal choices of hyper-parameters (also see Appendix B.2). The update equations at the t-th step follow from the computation of the posterior using Eq. 13 (see Murphy (2007) , for example, for the derivation of these update steps) and are given as: a t+1 = a t + 1/2; κ t+1 = κ t + 1; µ t+1 = κ t µ t + ŷt+1 κ t + 1 ; b t+1 = b t + κ t (ŷ t+1 -µ t ) 2 2(κ t + 1) . The hyper-parameters λ x (output precision) and κ (equivalent of the smoothing pseudo-count α 0 in classification) are picked on the validation set, along with a scaling pre-multiplier for the precision λ y (analogous to the classification setup). In order to place uniform priors over the output range, we will simulate a uniform set of samples over the output range. µ = E[y pseudo ] is the mean of the pseudo-samples, and β is initialized as 0.5(κ -1)Var(y pseudo ) (see Appendix B.3 for details).

5. EXPERIMENTS

We compare variants of online label-shift methods based on our discussion above on a mix of synthetic and realistic datasets to the un-adjusted model performance (BASE). • FTH and OGD: These are the variants proposed in Wu et al. (2021) . We evaluate both for two choices of confusion matrices each -computed using the in-distribution validation set, and using the out-of-distribution validation set (our HEURISTIC 1). We refer to these two alternatives as (C-IID) and (C-OOD). • FTH-H and OGD-H: These are our modifications of FTH and OGD using the scaling hyper-parameters proposed in HEURISTIC 2. For both variants, we again evaluate two versions each, using (C-IID) and (C-OOD). • FTH-H-B: This is our modification of FTH, with an additional pseudo-count hyperparameter added for smoothing. The hyper-parameters are learned on the OOD validation sets. We call the regression variant FTH-H-B (R). • OPTIMAL FIXED CLASSIFIERS: These oracle methods are derived by replacing p in Eq. 2 with the empirical location-wise label distributions, providing a sense of achievable gains if one were aware of the true label-distributions from the get-go. We include two variants -OFC, which uses Eq. 2, and OFC-H, which uses the modified update rule in Eq. 8 where the hyper-parameters are oracle hyper-parameters learned on the test-set. When using OGD, we use the surrogate loss implementation in Wu et al. (2021) since it is both better-performing as well as much faster. This variant involves using a smooth approximation of the 0-1 loss allowing for direct gradient computation instead of a numerical approximation.

5.1.1. SYNTHETIC: SKEWED-MNIST

We split MNIST classes into two subsets: [0, 1, 2, 5, 9] and [3, 4, 6, 7, 8] . We use different colors to correspond to different deployment locations, similar to Arjovsky et al. (2019) . In the training set, we color digits in a particular subset a particular color 99% of the time. This corresponds to a 99% skew in label-distributions across the two locations. The 1% cross-over instructs some color-invariance but not strongly enough to completely overcome the bias. The validation set uses opposing colours for the subsets, but with a 75% correlation -this represents a scenario where the class-distributions in different locations change from that in training. Finally, the test set uses completely flipped colors in the two subsets compared to the training set -this implies reversed label-distributions, resulting in poorer baseline performance. Since 2021) trained ResNet-50 based models along with their curation of this dataset, also evaluating several methods for OOD generalization and releasing all models. We use their models trained with the domain generalization method CORAL Sun & Saenko (2016) , since this model has improved performance over the ERM baseline. They released three sets of weights, trained with three random seeds. We evaluate all variants for each of the three seeds, with 3 random orderings each of the test set, and report aggregates in Table 1 . Koh et al. (2021) recommend evaluation with both average accuracy as well as macro-F1 (since some species in the dataset are rare). We perform evaluation with both metrics, but use our own trained models for average accuracy -this is because Koh et al. ( 2021) trained their models optimizing for macro F1. We similarly trained CORAL-augmented base models optimizing the penalty coefficient and choice of early stopping. We replace the confusion matrix with an identity matrix for evaluating methods on this dataset (for methods where a validation-set estimated confusion matrix is required). Confusion matrices evaluated on the validation sets are non-invertible for this dataset due to sparse class-representation and we found common alternatives to perform poorly (see Appendix E). Under review as a conference paper at ICLR 2023 

5.2.1. SYNTHETIC: MIX-OF-GAUSSIANS

We create a synthetic regression dataset by constructing a curve from a mixture of Gaussians. We pick regions on the x-axis to correspond to training, validation, and test sets, such that every set samples data from two regions each, corresponding to two locations (see Appendix C.2). In Figure 1b , we depict the curve, along with sampling indicators for the different sets and sources. The points have been placed at different heights for clearer visualization of overlaps. 500 points are sampled from the two training regions, and 250 each for the validation and test sets from their assigned regions. We train a 3-layer MLP with BatchNorm and ReLU activations and a mean squared loss for 100 epochs, yielding an in-distribution test mean squared error (MSE) of ∼ 0.15. In Table 2 we find that online updating reduces the OOD test MSE significantly. Results are aggregates over five trials, with a different random sampling of all data, followed by training and validation each time. Full results and more experimental details are in Appendix C.2).

5.2.2. WILDS-POVERTYMAP

We use the WILDS variant of a poverty mapping dataset Yeh et al. (2020) . This is a dataset for estimating average household economic conditions in a region through satellite imagery, measured by an asset wealth index computed from survey data. The data comprises 8-channel satellite images with data from 23 African countries. The locations here correspond to different countries. Due to the smaller size of the dataset, Koh et al. ( 2021) recommend a five-fold evaluation, where every fold is approximately constructed as follows -10K images from 13-14 countries in the training set; 1K images from the same countries for in-distribution validation; 1K images from these countries for in-distribution testing; 4K images from 4-5 countries not in the training set for OOD validation; and 4K images from 4-5 countries in neither training nor validation sets for OOD test. The evaluation metric is Pearson's correlation between predicted economic index vs. actual index, as is standard in the literature (Yeh et al., 2020) . Following Koh et al. ( 2021), we split the assessment into overall average as well as worst-group performance, which picks the worst performance across rural/urban subgroups. As with IWILDCAM, we use the CORAL-augmented base networks and weights released by Koh et al. ( 2021), but with our retrained versions for average correlation coefficient (since the validation choices for the released weights were for worst group performance). We evaluate separately for each fold (which have quite a bit of variance in base performance) with 5 random orderings of each of the test sets. In Table 2 , we find that while there seems generally little to no improvement for average correlation, there are more significant improvements for three of five folds in terms of worst-group performance. As noted in Koh et al. ( 2021), a wide range of differences along many dimensions such as infrastructure, agriculture, development, cultural aspects play a role not only in determining wealth-distribution, but also in terms of how the features manifest in different places. Such real-world issues imply that validating for OOD performance is bound to be sensitive to problem types and the specific choices of validation sets used to tune hyper-parameters, and the differences that may arise between an OOD validation set and an OOD test set. This issue extends generally to all attempts at OOD generalization.

5.3. TAKEAWAYS

Our experiments are generally suggestive of the following takeaways. • While invertible confusion matrices are not always achievable due to data scarcity (as modelled in our experiments with WILDS-IWILDCAM), a practitioner can adopt confusionmatrix free methods such as FTH-H-B, which we find to provide competitive or improved performance. Using OOD validation sets to estimate confusion matrices can improve results relative to using an IID validation set, although confusion matrices estimated on smaller-sized sets can be noisy. • Learning additional scaling hyper-parameters can be useful for further improvements. We find this trend to not hold for SKEWED-COCO-ON-PLACES (FTH outperforms FTH-H and FTH-H-B). We suspect this is likely due to instability from the relatively smaller size of the validation set -when picking oracle scaling hyper-parameters on the test set, we achieve an accuracy of 59.37 ± 0.89. In Appendix D we compare performance when learning hyper-parameters on different validation sets -IID/OOD/test (oracle).

6. RELATED WORK

Label-shift for classifiers Saerens et al. (2002) provides a seminal discussion about adapting the output distribution of a classifier when the test set undergoes label-shift. This approach presumes access to the entire test set up front, or a sufficiently representative sample. More recent works have investigated other ways to estimate label-shift (Lipton et al., 2018; Azizzadenesheli et al., 2019) using confusion matrices, which partially inspired the methods in Wu et al. (2021) that we use as our foundation. It has been recently suggested (Alexandari et al., 2020; Garg et al., 2020) that the simple correction method in Saerens et al. (2002) often outperforms these later methods when combined with calibration. While Alexandari et al. (2020) perform their calibration using a held-out IID validation set for their iterative method, we adapt this strategy to the out-of-distributions setting by picking scaling hyper-parameters on an OOD validation set.

Test-time training

Another emerging line of literature focuses on updating neural network parameters using test data without being able to match training statistics with test statistics, due to the potential lack of access to training data for the same topical reasons -data privacy and large datasets. Some examples include updating the Batch-Norm statistics optimizing for minimum test-time entropy Wang et al. (2021) , or using self-supervised pseudo-labels to adapt the feature extraction part of the network Liang et al. (2020) . Our setup here can be viewed as a form of test-time training, but in a more constrained setting, with inaccessible model parameters and no resources to replicate an onsite-model by querying the black-box model, e.g. using distillation (Hinton et al., 2015) . 

7. CONCLUSION

In this paper, we empirically investigated the effectiveness of online black-box adaptation methods for label-shift when a key underlying assumption of invariant class-conditional input distributions is broken. We found that while existing methods can be effective to an extent regardless of conditionalshift, performance can be improved by adopting intuitive heuristics -in particular, estimating confusion matrices on OOD validation sets, and learning additional scaling hyper-parameters in the output adjustment step to account for shifting distributions.

A POSTERIOR UPDATE

We derive the posterior update equation (Eq. 13), specifying the conditions under which this rule holds. The key assumption is that in the new deployment location, categories are encountered in an IID manner in the location, i.e., y j ⊥ ⊥ y k . The required distributions are defined as P t (ϕ) = P (ϕ | y 1 , • • • , y t ), = P (y 1 , • • • , y t | ϕ) P (ϕ) P(y 1 , • • • , y t ) , (Bayes rule) (22) ∝ P (y 1 , • • • , y t | ϕ) P (ϕ), (dropping terms independent of ϕ) (23) = t i=1 P (y i | ϕ) P (ϕ) (using assumption y j ⊥ ⊥ y k ) (24) = P (y t | ϕ) t-1 i=1 P (y i | ϕ) P (ϕ) (regrouping terms) (25) = P (y t | ϕ) P t-1 (ϕ), ( P (y | x) ∝ exp - λ x 2 y -f (x) 2 , P new (y) ∝ 1 + L 2a (y -µ) 2 -2a+1 2 , P (y) ∝ exp - λ y 2 y -m 2 , which gives us the objective J = -log P (y | x) expressed as J = -log P (y | x) -log P new (y) + log P (y) (31) = λ x 2 y -f (x) 2 - λ y 2 y -m 2 + 2a + 1 2 log 1 + L 2a (y -µ) 2 The derivative of this objective wrt y is ∂J ∂y = λ x (y -f (x)) -λ y (y -m) + 2a+1 £ 2 L 2a . ¡ 2.(y -µ) 1 + L 2a (y -µ) 2 (33) = λ x (y -f (x)) -λ y (y -m) + (2a + 1) L 2a (y -µ) 1 + L 2a (y -µ) 2 (34) = λ x -λ y τ d y + λ y m -λ x f (x) τµ + A (2a + 1) M L 2a (y -µ) 1 + L 2a (y -µ) 2 (35) = τ d y + τ µ + AM (y -µ) 1 + M (y -µ) 2 Setting to zero, we have τ d y + τ µ 1 + M (y -µ) 2 + AM (y -µ) = 0 (37) =⇒ τ d y + τ µ 1 + M y 2 + M µ 2 -2M µy + AM (y -µ) = 0 (38) =⇒ τ d y + M τ d y 3 + M µ 2 τ d y -2M µτ d y 2 + τ µ + M τ µ y 2 + M τ µ µ 2 -2M µτ µ y + AM y -AM µ = 0 (39) =⇒ M τ d y 3 + (M τ µ -2M µτ d )y 2 + (τ d + M µ 2 τ d -2M µτ µ + AM )y + (τ µ + M τ µ µ 2 -AM µ) = 0 which is the equation we shall solve for y. We use NUMPYs polynomial solver to find roots. A cubic equation either has one real and a pair of conjugate imaginary roots, or all real roots. We test the real solutions for a positive curvature (implying local minima), and pick the minima resulting in smallest value of the objective J.

B.2 SECOND DERIVATIVE TEST FOR SOLUTIONS

The second derivative of J is given by τ d - 2AM 2 (y -µ) 2 (1 + M (y -µ) 2 ) 2 + AM 1 + M (y -µ) 2 Writing y -µ as D, we have τ d + AM (1 + M D 2 ) - 2AM 2 D 2 (1 + M D 2 ) 2 = τ d + AM 1 + M D 2 1 - 2M D 2 1 + M D 2 = τ d + AM (1 -M D 2 ) (1 + M D 2 ) 2 When this expression is positive, we have a local minima. For the first term to be positive, we require that τ d > 0, which has a straightforward intuitive interpretation: τ x > τ y , i.e. output precision should be higher than marginal-adjustment precision. This is a reasonable condition which we expect to be fulfilled, since we typically expect to rely more strongly on the underlying predictive model than simply the marginal. In the second term, AM is always non-negative, for a positive pseudo-count. The denominator is always positive. Substituting in expressions for the values after the t-th update, we have M D 2 = κt κt+1 (y -µ t ) 2 t-1 τ =0 κτ κτ +1 (ŷ τ +1 -µ τ ) 2 . ( ) When this term is ≤ 1, we are guaranteed positivity (strictly speaking, τ d provides the second term with some room for negative values, but we ignore it for simplified reasoning). This condition implies (y -µ t ) 2 ≤ κ t + 1 κ t t-1 τ =0 κ τ κ τ + 1 (ŷ τ +1 -µ τ ) 2 , which then implies that the following range for y allows local minima µ t - κ t + 1 κ t t-1 τ =0 κ τ κ τ + 1 (ŷ τ +1 -µ τ ) 2 ≤ y ≤ µ t + κ t + 1 κ t t-1 τ =0 κ τ κ τ + 1 (ŷ τ +1 -µ τ ) 2 . ( ) An intuitive interpretation of this condition is that valid updates are allowed within an increasing range as a function of the total observed variances up to the t-th test example. In practice, we find that validation tends to pick values for τ x > τ y , and that the case for no-local-minima typically does not arise for the optimal hyper-parameters in our experiments.

B.3 INITIALIZING PRIORS

For initializing priors, we might endeavour to stay unbiased, since we assume that deployment locations can have significantly different target distributions than we might anticipate from the marginal over the training set. For classification, we built this in by using a uniform pseudo-count for all classes and sources. For regression, we simulate a pseudo-count of uniform samples from the output range. OOD validation points: OOD validation points are sampled from N (-3.5, 0.2) and N (1, 0.2), with 250 points each. OOD test points: OOD test points are sampled from N (0, 0.2) and N (3, 0.2), with 250 points each. For OOD sets, the different sampling distributions correspond to different locations. For different trials, we repeat the whole experiment from scratch, sampling new training, validation, and test sets, and performing validation every time. The network architecture is a 3 layer MLP with 128 hidden units, with BATCHNORM and RELU after hidden activations. A weight decay of 1e -8 is applied on all parameters. We train for a 100 epochs with batch-sizes of 100, with SGD + Momentum (0.9), starting with an initial learning rate of 0.01 and scaling it by 0.95 after every epoch. We Note that the pattern of label-shift is the same across validation and test subsets (albeit of a smaller size). This proof-of-concept experiment is intended as a middle-ground between the COLORED MNIST and WILDS-IWILDCAM experiments, in that the potential of learning hyper-parameters to account for conditional shift is tested while keeping label-shift pattern fixed). We train for 400 epochs with SGD + Momentum (0.9), using batch sizes of 128, with an initial learning rate of 0.1 which is cut by 5 at the 240th, 320th, 360th epochs. An L2 weight decay regulariser is applied on all parameters with a coefficient of 5e-4. We normalize images with the training set mean and standard deviation per channel, and apply data augmentation of random crops to 224 × 224 and random horizontal reflections.

D HYPER-PARAMETER SELECTION

We contrast performance when methods use IID validation sets vs. OOD validation sets vs. the test set itself, in Table 6 . We observe that, generally speaking, OOD validation can improve over IID validation.

E IDENTITY APPROXIMATION FOR CONFUSION MATRIX

Degenerate confusion matrices can arise when there are missing categories in the validation set used to compute it (leading to zero-rows), or if two or more rows are exactly the same (for example, when multiple rare categories both get categorized the same way). Two options are to use a soft-confusion matrix, or a pseudo-inverse (Lipton et al., 2018) . Since the IWILDCAM dataset is significantly long-tailed, with a large number of classes not represented in the validation sets, we end up with a number of zero rows for the soft-confusion matrix. For such rows, we simply placed a 1 in the diagonal element. In Table 7 , we find these alternatives to result in degraded performance for IWILDCAM, generally much worse than our identity approximation. We hypothesize that part of the reason is to do with the fact that both our zero-confusion heuristic for dealing with missing classes for the soft-confusion matrix, as well as the same underlying effect being applied by the pseudo-inverse results in a misleading effect: rare classes, absent from validation sets, are in fact more likely to be confused than the frequent ones. This is one possibility for why the less presumptive identity approximation performs better. The inherent difficulty is estimating robust confusion matrices has been recognized in the literature, with the typical approach being to hold out significantly large validation sets in order to reliably estimate less noisy confusion matrices. In Table 8 , we include numbers from an identity approximation in the synthetic datasets where the confusion matrices were invertible. On the whole, we suggest to practitioners that in difficult, real-life situations, simpler approximations might continue to serve us well, while more sophisticated methods can pose specific requirements to be successful. F HYPERPARAMETERS, COMPUTE, AND CODE AND DATA LICENSES. The hyper-parameters involved are the two calibration terms λ u , λ y and the pseudo-count term α 0 for classification, and λ x , λ y , κ for the regression problems. These were picked via grid-search on the OOD validation sets, optimizing for OOD performance in all cases. For OGD methods, an additional V100 GPUs were used to train base models (in cases where we trained our own models), and the online adjustment experiments were performed on an Apple Macbook Air with saved outputs from the models. Table 7 : We compare use of a soft-confusion matrix and the pseudo-inverse with our approximation with an identity matrix for IWILDCAM. We find that FTH performance drops strongly, and for OGD, the optimal learning rate is most often zero, leading to no differences with base performance. For OGD, we find the optimal learning rate on the test-set for all choices of confusion matrix, reporting best-case performance. 



a) Synthetic variant of the MNIST dataset constructed by using colors to correspond to sources with skewed label-distributions. The colors are flipped for validation and test with different correlation strengths, corresponding to (almost completely) reversing the label-skew at the sources at test-time. (b) Synthetic MIX-OF-GAUSSIANS data. Differently colored regions along the x-axis correspond to training, validation and test samples, with different regions of the same color corresponding to different sources/locations.

Figure 1: Synthetic MNIST and Gaussian datasets.

Figure 2: Skewed COCO-on-Places: Synthetic dataset constructed by superimposing COCO objects (Lin et al., 2014) on scenes from the Places dataset (Zhou et al., 2017). The 5 columns correspond to 5 sources of data, where the backgrounds correspond to examples of particular scenes, and the skew in number of examples per row correspond to the skew in label distribution we impose. Different background scenes are used for training, validation, and test sets.

, more photo-realistic, synthetic dataset by superimposing segmented objects fromCOCO Lin et al. (2014)  on to scenes from the PLACES datasetZhou et al. (2017), as inAhmed et al. (2021). The scenes correspond to the notion of a deployment location, albeit with significant intra-location variation. For every such scene-represented source, we use a different class-distribution to simulate source-specific skews in the label distribution. In Fig.2the relative number of images per row represent the relative frequency of a particular class at a specific source. There are a total of ∼ 10K training images, ∼ 2.5K validation images (each for seen and unseen sources), and ∼ 6K test images (each for seen and unseen sources). Classification problems: Average accuracy on SKEWED-MNIST, SKEWED-COCO-ON-PLACES, and WILDS-IWILDCAM (also reporting macro F1-score for IWILDCAM). Overall trends indicate that our heuristics are helpful, and FTH-H-B is competitive or better without needing a confusion matrix. The data consists of burst images taken at camera traps, triggered by animal motion. The task is to identify the species in the picture, and the locations correspond to the unique camera trap the pictures are from. There are a total of 182 species in this version of the dataset across a total of 323 camera traps. There is significant skew in terms of species distribution across different camera traps, as well as the number of images available for each trap. The training set consists of ∼ 130K images from 243 traps; the in-distribution validation set consists of ∼ 7.3K images from the same traps as that in the training set but on different dates; the OOD validation set consists of ∼ 15K images taken at 32 traps that are different from the ones in the training set; the in-distribution test set consists of ∼ 8.1K images taken by the same camera traps as in the training set, but on different dates from both training and validation; finally, the OOD test set consists of ∼ 43K images taken at 48 camera traps that are different from those for all other splits.

Regression problems: For the GAUSSIANS dataset the metric is mean squared error (lower is better), and for the PovertyMap folds the metric is Pearson's correlation co-efficient (higher is better), computed separately for average (ALL) and worst-group (WG) performance.

There has been a recent surge in interest for methods aiming to learn stable or invariant features across different domains/environments/groupsSun & Saenko (2016);Arjovsky et al. (2019);Krueger et al. (2020);Sagawa et al. (2020). Such approaches have been demonstrated to be useful for certain types of distributional shifts, such as with improved minority group robustnessSagawa et al. (2020) and systematic generalizationAhmed et al. (2021). Our discussion in this paper is complementary to this set of methods in OOD generalization research. One can use an underlying model trained with cross-group penalties that result in improved OOD generalization, and further improve performance by factoring in useful contextual information.

include the non-aggregated MSEs below to confirm that there are consistent improvements over every base model/data-sampling individually.We chose the following objects for this synthetic classification task: bicycle, train, cat, chair, horse, motorcycle, bus, dog, couch, and zebra; and the following scenes to simulate different sources.Across the 5 sources, the number of examples for training, validation, and test sets are as follows.

Training set

Test sets

(top)  Classification problems: Performance when picking hyper-parameters on IID, OOD validation sets, or on (Oracle) test sets. (bottom) Regression problems: Performance when picking hyper-parameters on IID, OOD validation sets, or on (Oracle) test sets. For MIX-OF-GAUSSIANS, we use mean squared error as the metric (lower is better), while for POVERTYMAP the metric is the Pearson's correlation co-efficient (higher is better). is the learning rate used for updating p. This learning rate is searched over a range from 1e-8 to 10 in steps of ×10.

Identity approximation with S-MNIST and S-COCO-ON-PLACES, with test-time performance using the original confusion matrix C f for reference. When using the identity approximation, OGD (IID) uses the IID validation set to estimate C g and OGD (OOD) uses the OOD validation set.

acknowledgement

We reused code from https://github.com/p-lambda/wilds, released under the MIT License, and code from https://github.com/wrh14/online_adaption_to_label_ distribution_shift, publicly released by Wu et al. (2021) . We also used data from MS-COCO, released under the CREATIVE COMMONS ATTRIBUTION 4.0 LICENSE. WILDS-IWILDCAM is under COMMUNITY DATA LICENSE AGREEMENT -PERMISSIVE -V1.0, and the WILDS-POVERTYMAP data is U.S. PUBLIC DOMAIN (LANDSAT/DMSP/VIIRS).

annex

If we start with a reference prior for the Normal-Gamma distribution with parameter settings µ = ., κ = 0, α = -0.5, β = 0, (46) then after observing a N data-points {y 1 , • • • , y N }, y i ∼ U[L, H] (the uniformly sampled points we will simulate), the resulting posterior isIn this view, κ corresponds to the pseudo-count (as per the interpretation of the parameters of the Normal-Gamma conjugate prior as in Murphy (2007) ). α is defined in terms of κ. To improve stability, we will set µ to the middle of the output range rather than actually estimate the mean of our uniform pseudo-samples. Likewise, we will set β by estimating its value as a function of κ and using the expression for variance of a uniform distribution,C EXPERIMENTAL DETAILS

C.1 SYNTHETIC MNIST

The splitting of digits into two sets is performed by observing mis-classification matrices after 200 iterations of training a neural network averaged across a 100 runs -digits are put into opposing sets if they tend to be confused, while also trying to keep the set-sizes balanced.The network architecture consists of 3 CONV layers with 64, 128 and 256 channels, each followed by MAXPOOL, BATCHNORM, and RELU. After the third layer, we spatially mean-pool activations and use a linear layer to map to the logits. A weight-decay of 5e -4 is applied on all parameters.Training is conducted for 20 epochs with batches of size 256 where training accuracy saturates to 100%. An initial learning rate of 0.1 is used, which is cut by 5 at the 6-th, 12-th and 16-th epochs.The datapoint-counts in the train/val/test environments are as follows. 

C.2 SYNTHETIC GAUSSIAN

The synthetic data for this experiment is generated with the following function 

