COPING WITH LABEL SHIFT VIA DISTRIBUTIONALLY ROBUST OPTIMISATION

Abstract

The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an unlabelled test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in multiple test environments. Can one instead learn a single classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.

1. INTRODUCTION

Classical supervised learning involves learning a model from a training distribution that generalises well on test samples drawn from the same distribution. While the assumption of identical train and test distributions has given rise to useful methods, it is often violated in many practical settings (Kouw & Loog, 2018) . The label shift problem is one such important setting, wherein the training distribution over the labels does not reflect what is observed during testing (Saerens et al., 2002) . For example, consider the problem of object detection in self-driving cars: a model trained in one city may see a vastly different distribution of pedestrians and cars when deployed in a different city. Such shifts in label distribution can significantly degrade model performance. As a concrete example, consider the performance of a ResNet-50 model on ImageNet. While the overall error rate is ∼ 24%, Figure 1 reveals that certain classes suffer an error as high as ∼ 80%. Consequently, a label shift that increases the prevalence of the more erroneous classes in the test set can significantly degrade performance. Most existing work on label shift operates in the setting where one has an unlabelled test sample that can be used to estimate the shifted label probabilities (du Plessis & Sugiyama, 2014; Lipton et al., 2018; Azizzadenesheli et al., 2019) . Subsequently, one can retrain a classifier using these probabilities in place of the training label probabilities. While such techniques have proven effective, it is not always feasible to access an unlabelled set. Further, one may wish to deploy a learned model in multiple test environments, each one of which has its own label distribution. For example, the label distribution for a vehicle detection camera may change continuously while driving across the city. Instead of simply deploying a separate model for each scenario, deploying a single model that is robust to shifts may be more efficient and practical. Hence, we address the following question in this work: can we learn a single classifier that is robust to a family of arbitrary shifts? We answer the above question by modeling label shift via distributionally robust optimisation (DRO) (Shapiro et al., 2014; Rahimian & Mehrotra, 2019) . DRO offers a convenient way of coping with distribution shift, and have lead to successful applications (e.g. Faury et al. (2020) ). Intuitively, by seeking a model that performs well on all label distributions that are "close" to the training data label distribution, this task can be cast as a game between the learner and an adversary, with the latter allowed to pick label distributions that maximise the learner's loss. We remark that while adversarial perspectives have informed popular paradigms such as GANs, these pursue fundamentally different objectives from DRO (see Appendix A for details). Although several previous works have explored DRO for tackling the problem of example shift (e.g., adversarial examples) (Namkoong & Duchi, 2016; 2017; Duchi & Namkoong, 2018) , an application of DRO to the label shift setting poses several challenges: (a) updating the adversary's distribution naïvely requires solving a nontrivial convex optimisation subproblem with limited tractability, and also needs careful parameter tuning; and (b) naïvely estimating gradients under the adversarial distribution on a randomly sampled minibatch can lead to unstable behaviour (see §3.1). We overcome these challenges by proposing the first algorithm that successfully optimises a DRO objective for label shift on a large scale dataset (i.e., ImageNet). Our objective encourages robustness to arbitrary label distribution shifts within a KL-divergence ball of the empirical label distribution. Importantly, we show that this choice of robustness set admits an efficient and stable update step.

Summary of contributions

(1) We design a gradient descent-proximal mirror ascent algorithm tailored for optimising large-scale problems with minimal computational overhead, and prove its theoretical convergence. (2) With the proposed algorithm, we implement a practical procedure to successfully optimise the robust objective on ImageNet scale for the label shift application. (3) We show through experiments on ImageNet and CIFAR-100 that our technique significantly improves over baselines when the label distribution is adversarially varied.

2. BACKGROUND AND PROBLEM FORMULATION

In this section we formalise the label shift problem and motivate its formulation as an adversarial optimisation problem. Consider a multiclass classification problem with distribution p tr over instances X and labels Y = [L]. The goal is to learn a classifier h θ : X → Y parameterised by θ ∈ Θ, with the aim of ensuring good predictive performance on future samples drawn from p tr . More formally, the goal is to minimise the objective min θ E (x,y)∼ptr [ (x, y, θ) ], where : X × Y × Θ → R + is a loss function. In practice, we only have access to a finite sample S = {(x i , y i )} n i=1 ∼ p n tr , which motivates us to use the empirical distribution p emp (x, y) = 1 n n i=1 1(x = x i , y = y i ) in place of p tr . Doing so, we arrive at the objective of minimising the empirical risk: min θ E pemp [ (x, y, θ)] := 1 n n i=1 (x i , y i , θ). (1) The assumption underlying the above formulation is that test samples are drawn from the same distribution p tr that is used during training. However, this assumption is violated in many practical settings. The problem of learning from a training distribution p tr , while attempting to perform well on a test distribution p te = p tr is referred to as domain adaptation (Ben-David et al., 2007) . In

Label distribution Reference

Train distribution Standard ERM Specified a-priori (e.g., balanced) (Elkan, 2001; Xie & Manski, 1989; Cao et al., 2019 ) Estimated test distribution (du Plessis & Sugiyama, 2014; Lipton et al., 2018; Azizzadenesheli et al., 2019; Garg et al., 2020; Combes et al., 2020 ) Worst-performing class (Hashimoto et al., 2018; Mohri et al., 2019; Sagawa et al., 2020 ) Worst k-performing classes (Fan et al., 2017; Williamson & Menon, 2019; Curi et al., 2019; Duchi et al., 2020) Adversarial shifts within KL-divergence This paper Table 1 : Summary of approaches to learning with a modified label distribution. the special case of label shift, one posits that p te (x | y) = p tr (x | y), but the label distribution p te (y) = p tr (y) (Saerens et al., 2002) ; i.e., the test distribution satisfies p te (x, y) = p te (y)p tr (x | y). The label shift problem admits the following three distinct settings (see Table 1 for a summary): (1) Fixed label shift. Here, one assumes a-priori knowledge of p te (y). One may then adjust the outputs of a probabilistic classifier post-hoc to improve test performance (Elkan, 2001) . Even when the precise distribution is unknown, it is common to posit a uniform p te (y). Minimising the resulting balanced error has been the subject of a large body of work (He & Garcia, 2009) , with recent developments including Cui et al. (2019) ; Cao et al. (2019) ; Kang et al. (2020) ; Guo et al. (2020) . (2) Estimated label shift. Here, we assume that p te (y) is unknown, but that we have access to an unlabelled test sample. This sample may be used to estimate p te (y), e.g., via kernel meanmatching (Zhang et al., 2013) , minimisation of a suitable KL divergence (du Plessis & Sugiyama, 2014), or using black-box classifier outputs (Lipton et al., 2018; Azizzadenesheli et al., 2019; Garg et al., 2020) . One may then use these estimates to minimise a suitably re-weighted empirical risk. (3) Adversarial label shift. Here, we assume that p te (y) is unknown, and guard against a suitably defined worst-case choice. Observe that an extreme case of label shift involves placing all probability mass on a single y * ∈ Y. This choice can be problematic, as (1) may be rewritten as min θ y∈[L] p emp (y) • 1 n y i : yi=y (x i , y i , θ) , where n y is the number of training samples with label y. The empirical risk is thus a weighted average of the per-class losses. Observe that if some y * ∈ Y has a large per-class loss, then an adversary could degrade performance by choosing a p te with p te (y * ) being large. One means of guarding against such adversarial label shifts is to minimise the minimax risk (Alaiz-Rodríguez et al., 2007; Davenport et al., 2010; Hashimoto et al., 2018; Mohri et al., 2019; Sagawa et al., 2020 ) min θ max π∈∆ L y∈[L] π(y) • 1 n y i : yi=y (x i , y i , θ) , where ∆ L denotes the simplex. In (2), we combine the per-label risks according to the worst-case label distribution. In practice, focusing on the worst-case label distribution may be overly pessimistic. One may temper this by instead constraining the label distribution. A popular choice is to enforce that π ∞ ≤ 1 k for suitable k, which corresponds to minimising the average of the top-k largest per-class losses for integer k (Williamson & Menon, 2019; Curi et al., 2019; Duchi et al., 2020) . We focus on the adversarial label shift setting, as it meets the desiderata of training a single model that is robust to multiple label distributions, and not requiring access to test samples. Adversarial robustness has been widely studied (see Appendix A for more related work), but its application to label shift is much less explored. Amongst techniques in this area, Mohri et al. (2019) ; Sagawa et al. (2020) are most closely related to our work. These works optimise the worst-case loss over subgroups induced by the labels. However, both works consider settings with a relatively small (≤ 10) number of subgroups; the resultant algorithms face many challenges when trained with many labels (see Section 4). We now detail how a suitably constrained DRO formulation, coupled with optimisation choices, can overcome this limitation.  = {π ∈ ∆ L | d(π, p emp ) ≤ r}, where P is an uncertainty set containing perturbations of the empirical distribution p emp . This is an instance of distributionally robust optimisation (DRO) (Shapiro et al., 2014) , a framework where one minimises the worst-case expected loss over a family of distributions. In this work, we instantiate DRO with P being a parameterised family of distributions with varying marginal label distributions in KL-divergence, i.e., d(p, q) = E y∼q [-log p(y)/q(y)] . (We use this divergence, as opposed to a generic f -divergence, as it affords closed-form updates; see §3.3.) Solving (3) thus directly addresses adversarial label shift, as it ensures our model performs well on arbitrary label distributions from P. Observe further that the existing minimax risk (2) is a special case of (3) with r = +∞. Having stated our learning objective, we now turn to the issue of how to optimise it. One natural thought is to leverage strategies pursued in the literature on example-level DRO using f -divergences. For example, Namkoong & Duchi (2016) propose an algorithm that alternately performs iterative gradient updates for model parameters θ and adversarial distribution π, assuming access to projection oracles, and the ability to sample from the adversarial distribution. However, there are challenges in applying such techniques on large-scale problems (e.g., ImageNet): (1) directly sampling from π is challenging in most data loading pipelines for ImageNet. (2) projecting π onto the feasible set P requires solving a constrained convex optimization problem at every iteration, which can incur non-trivial overhead (see Appendix E). We now describe ADVSHIFT (Algorithm 1), our approach to solve these problems. In a nutshell, we iteratively update model parameters θ and adversarial distributions π. In the former, we update exactly as per ERM optimization (e.g., ADAM, SGD), which we denote as NNOpt (neural network optimiser); in the latter, we introduce a Lagrange multiplier to avoid projection. Extra care is needed to obtain unbiased gradients and speed up adversarial convergence, as we now detail.

3.1. ESTIMATING THE ADVERSARIAL MINIBATCH GRADIENT

For a fixed π ∈ ∆ L , to estimate the parameter gradient E π [∇ θ (x, y, θ)] on a training sample S = {(x i , y i )} n i=1 , we employ the importance weighting identity and write E π [∇ θ (x, y, θ)] = E pemp 1 p emp (y) • ∇ θ (x, y, θ) = 1 n i π(yi) pemp(yi) • ∇ θ (x i , y i , θ). We may thus draw a minibatch as usual from S, and apply suitable weighting to obtain unbiased gradient estimates. A similar reweighting is necessary to compute the adversary gradients E π [∇ π (x, y, θ)]. Making the adversarial update efficient requires further effort, as we now discuss.

3.2. REMOVING CONSTRAINTS BY LAGRANGIAN DUALITY

To efficiently update the adversary distribution π in (3), we would like to avoid the cost of projecting onto P. To bypass this difficulty, we make the following observation based on Lagrangian duality. Proposition 1. Suppose is bounded, and p emp is not on the boundary of the simplex. Then, ∀r > 0, ∃γ * > 0 such that for every γ c ≥ γ * , the constrained objective is solvable in unconstrained form: argmax π∈∆ L , KL(π,pemp)≤r E π [ (x, y, θ)] = argmax π∈∆ L E π [ (x, y, θ)] + min{0, γ c (r -KL(π, p emp ))}. Motivated by this, we may thus transform the objective (3) into: min θ max π∈∆ L E π [ (x, y, θ)] + min{0, γ c (r -KL(π, p emp ))}, where γ c > 0 is a sufficiently large constant; in practice, this may be chosen by a bisection search. The advantage of this formulation is that it admits an efficient update for π, as we now discuss.

3.3. ADVERSARIAL DISTRIBUTION UPDATES

We now detail how we can employ proximal mirror descent to efficiently update π. Observe that we may decompose the adversary's (negated) objective into two terms: f (θ, π) := -E π [ (x, y, θ)] and h(π) := max{0, γ c (KL(π, p emp ) -r)}, where h(π) is independent of the samples. Such decomposable objectives suggest using proximal updates (Combettes & Pesquet, 2011): π t+1 = prox λh (π t -λ∇ π f (θ t , π t )) := argmin π∈∆ L h(π) + 1 2λ ( π t -π 2 + 2λ ∇ π f (θ t , π t ), π ), where λ serves as the learning rate. The value of proximal descent relies on the ability to efficiently solve the minimisation problem in (5). Unfortunately, this does not hold as-is for our choice of h(π), essentially due to a mismatch between the use of KL-divergence in h, and Euclidean distance π t -π 2 in (5). Motivated by the advantages of mirror descent over gradient descent on the simplex (Bubeck, 2014) , we propose to replace the Euclidean distance with KL-divergence: π t+1 = argmin π∈∆ L h(π) + 1 2λ (KL(π, π t ) + 2λ g t , π ), where g t is an unbiased estimator of ∇ π f (θ t , π t ). We have the following closed-form update. Lemma 2. Assume the optimal solution π t+1 to (6) satisfies KL(π t+1 , p emp ) = r, and that all the classes appeared at least once in the empirical distribution, i.e. ∀i, p i emp > 0. Let γ = γ c if r < KL(π t+1 , p emp ), and γ = 0 if r > KL(π t+1 , p emp ), then π t+1 permits a closed form solution π t+1 = (π t p α emp ) 1/(1+α) exp (η π g t )/C, where 1+α) exp (η π g t ) 1 projects π t+1 onto the simplex, and a b is the element-wise product between two vectors a, b. η π = 1 (γ+1/2λ)(1+α) , α = 2γλ, C = (π t p α emp ) 1/( In Algorithm 1, we set γ = γ c if r < KL(π t , p emp ) and 0 otherwise to appoximate the true γ. Such approximation works well when r -KL(π t , p emp ) does not change sign frequently.

3.4. CONVERGENCE ANALYSIS

We provide below a convergence analysis of our gradient descent-proximal mirror ascent method for nonconvex-concave stochastic saddle point problems. For the composite objective min θ max π∈∆ L f (θ, π) + h(π), and fixed learning rate η θ , we abstract the Algorithm 1 update as: θ t+1 = θ t -η θ g(θ t ), π t+1 = argmax π h(π) - 1 2λ (KL(π, π t ) + 2λ g(π t ), π ), where g(π), g(θ) are stochastic gradients assumed to satisfy the following. Assumption 1. The stochastic gradient g(θ) with respect to θ satisfies that for some σ > 0, E[g(θ)] = ∇ θ f (θ, π), and E[ g(θ) -E[g(θ)] 2 ] ≤ σ 2 . Assumption 2. The stochastic gradient g(π) with respect to π satisfies that for some G > 0, E[g(π)] = ∇ π f (θ, π), and E[ g(π) 2 ∞ ] ≤ G 2 . We make the following assumptions about the objective, similar to Lin et al. (2019; 2020) : Assumption 3. f (θ, π) + h(π) is L-smooth and l-Lipschitz; f (θ, π) and h(π) are concave in π. Assumption 4. Every adversarial distribution iterate π t satisfies KL(π t , p emp ) ≤ R for some R > 0. Assumption 3 and 4 may be enforced by adding a constant to the adversarial updates, which prevents π t from approaching the boundary of the simplex. Assumption 2 in the label shift setting implies that the loss is upper and lower bounded. Such an assumption may be enforced by clipping the loss for computing the adversarial gradient, which can significantly speed up training (see Appendix ??). Furthermore, this is a standard assumption for analyzing nonconvex-concave problems (Lin et al., 2019) . The assumption that the square L ∞ norm is bounded is weaker than L 2 norm being bounded; such a relaxation results from using mirror rather than Euclidean update. Given that the function F (θ) := max π∈∆ f (θ, π) + h(π) is nonconvex, our goal is to find a stationary point instead of approximating global optimum. Yet, due to the minimax formulation, the function F (θ) may not necessarily be differentiable. Hence, we define convergence following some recent works (Davis & Drusvyatskiy, 2019; Lin et al., 2019; Thekumparampil et al., 2019) on nonconvexconcave optimisation. First, Assumption 3 implies F (θ) is L-weakly convex and l-Lipschitz (Lin et al., 2019, Lemma 4.7) . Hence, we define stationarity in the language of weakly convex functions. Definition 1. A point θ is an -stationary point of a weakly convex function F if ∇F 1/2L (θ) ≤ , where F 1/2L (θ) denotes the Moreau envelope F 1/2L (θ) = min w F (w) + L w -θ 2 . With the above definition, we can establish convergence of the following update: Theorem 3 (informal). Under Assumptions 1-4, the update in (8) finds a point θ with E[ ∇F 1 2L (θ) ] ≤ in O( -8 ) iterations. For a precise description of the theorem, please see Appendix H. The above result matches the best known rate in Lin et al. (2019) for optimising nonconvex-concave problem with stochastic gradients. To our knowledge, this is the first result that studies convergence of composite objectives with proximal methods under nonconvex-concave settings. By utilizing the proximal operator, it solves the objective with an extra h(π) term without incurring additional complexity cost.

3.5. CLIPPING AND REGULARISING FOR FASTER CONVERGENCE

In addition to the proposed algorithm, we apply two additional techniques. We explain them here with motivations. First, we also observe that the adversarial's update could be very sensitive to the adversarial gradient g k , i.e. label-wise loss in each minibatch, because the gradient appears in the exponential of the update. To avoid convergence degradation resulted from the noise in g k , we clip the label-wise loss at value 2. Second, we notice that the KL divergence from any interior point of a simplex to its boundary is infinity. Hence, updates near boundary can be highly unstable due to the nonsmooth KL loss. To cope with this, we add a constant term on the adversarial distribution to avoid the adversarial distribution reaching any of the vertices on the simplex. The term and clipping is critical in both training and convergence analysis. We conduct an ablation of the sensitivity to these parameters in Figures 5 and 6 . Note that the experiments show that even without these tricks, our proposed algorithm alone still outperform baselines.

3.6. DISCUSSION AND COMPARISON TO EXISTING ALGORITHMS

A number of existing learning paradigms (e.g., fairness, adversarial training, and domain adaptation) have connections to the problem of adversarial label shift; see Appendix A for details. We comment on some key differences between ADVSHIFT and related techniques in the literature. For the problem of minimising the worst-case loss (2) -which is equivalent to setting the radius r = +∞ in (3) - Sagawa et al. (2020) propose an algorithm that assumes the ability to sample data from a given group in order to evaluate adversarial gradients. Such sampling is cumbersome to implement in most ImageNet data loading pipelines. Mohri et al. (2019) propose a way to evaluate gradients using importance sampling, and then apply projected gradient descent-ascent. This method suffers from instability owing to sampling (upon which we improve with proximal updates), and incurs a non-trivial computational overhead due to the projection step. We will illustrate these 2019) provide an algorithm that updates weights using EXP3. This approach relies on a determinantal point process, which has a poor dimension-dependence.

4. EXPERIMENTAL RESULTS

We now present a series of experiments to evaluate the performance of the proposed ADVSHIFT algorithm and how it compares to related approaches from the literature. We first explain our experiment setups and evaluation methods. We then present the results on ImageNet dataset, and show that under the adversarial validation setting, our proposed algorithm significantly outperforms other methods discussed in Table 1 . Similar results on CIFAR-100 are shown in the Appendix.

4.1. EXPERIMENTAL SETUP

To evaluate the proposed method, we use the standard image classification setup of training a ResNet-50 on ImageNet using SGD with momentum as the neural network optimiser. All algorithms are run for 90 epochs, and are found to take almost the same clock time. Note that ImageNet has a largely balanced training label distributions, and perfectly balanced validation label distributions. We assess the performance of models under adversarial label shift as follows. First, we train a model on the training set and compute its error distribution on the validation set. Next, we pick a threshold τ on the allowable KL divergence between the train and target distribution and find the adversarial distribution within this threshold which achieves the worst-possible validation error. Finally, we compute the validation performance under this distribution. Note that τ = 0 corresponds to the train distribution, while τ = +∞ corresponds to the worst-case label distribution (see Figure 1 ). We evaluate the following methods, each corresponding to one row in Table 1 : (i) standard empirical risk minimisation (BASELINE) (ii) balanced empirical risk minimisation (BALANCED) (iii) agnostic federated learning algorithm of Mohri et al. (2019) , which minimises the worst-case loss (AGNOSTIC) (iv) our proposed KL-divergence based algorithm, for various choices of adversarial radius r (AD-VSHIFT) (v) training with ADVSHIFT with a fixed adversarial distribution extracted from Figure 3 (c) (FIXED). This corresponds to the estimated test distribution row in Table 1 with an ideal estimator.

4.2. RESULTS AND DISCUSSION

Figure 2 shows the train and validation performance on ImageNet. Each curve represents the average and standard deviation across 10 independent trials. To better illustrate the differences amongst methods, we plot the difference in error to the BASELINE method. (See Figure 8 Even at τ = 1, the adversarial distribution is highly concentrated on only a few hard labels. Starting off from a uniform distribution over labels, the adversary quickly infers the relative difficulty of a small fraction of labels, assigning nearly 2× the weight on them compared to the average. This distribution remains largely stable in subsequent iterations, getting gradually more concentrated as training converges. Hyperparameters for each method are separately tuned. FIXED 1, 2, 3 corresponds to training with each of the three adversarial distributions in Figure 3 (c). We see that: • the reduction offered by ADVSHIFT is consistently superior to that afforded by the AGNOSTIC, BALANCED and FIXED methods. On the training set, we observe significant (∼ 8%) reduction in performance for large KL divergence thresholds. On the validation set, the gains are less pronounced (∼ 2.5%), indicating some degradation due to a generalisation gap. • while ADVSHIFT consistently improves above the baseline across adversarial radii, we observe best performance for r = 0.1. Smaller values of r lead to smaller improvements, while training becomes increasingly unstable for larger radii. Please see the discussion in the last section. • during training, AGNOSTIC either learns the adversarial distribution too slowly (such that it behaves like ERM), or uses too large a learning rate for the adversary (such that the training fails). This highlights the importance of the proximal mirror ascent updates in our algorithm. Illustration of distributions at fixed KL thresholds. Figure 3 (c) visualises the adversarial distributions corresponding to a few values of the KL threshold τ . At a threshold of τ = 3, the adversarial distribution is concentrated on only a few hard labels. Consequently, the resulting performance on such distributions is highly reflective of the worst-case distribution that can happen in reality. Training with a fixed adversarial distribution. Suppose we take the final adversarial distributions shown in Figure 3 (c), and then employ them as fixed distributions during training; this corresponds to the specified a-priori and estimated validation distribution approaches in Table 1 . Does the resulting model similarly reduce the error on hard classes? Surprisingly, Figure 2 (d) indicates this is not so, and performance is in fact significantly worse on the "easy" classes. Employing a fixed adversarial distribution may thus lead to underfitting, which has an intuitive explanation: the model must struggle to fit difficult patterns from early stages of training. Similar issues with importance weighting in conjunction with neural networks have been reported in Byrd & Lipton (2019) . Evolution of error distributions. To dissect the evolution of performance during training, Figure 3 shows violin plots of the distribution of errors for both the BASELINE and our ADVSHIFT methods Figure 6 : Ablation of gradient stabilisation parameter , which is a constant added to the gradient updates to prevent iterates from reaching the vertices of the simplex. We see that without any gradient stabilisation, the model's performance rapidly degrades as the adversarial radius increases. Conversely, performance also suffers when the stablisation is too high. after fixed training epochs. We observe that on the training set, ADVSHIFT significantly reduces the worst-case error, evidenced by the upper endpoints of the distribution being reduced. Note also that, as expected, the adversarial algorithm is slower to reduce the error on the "easy" classes early in training, evidenced by the lower endpoints of the distribution initially taking higher values. On the validation set, the reduction is consistent, albeit less pronounced owing to a generalisation gap. Evolution of learned adversarial weights. To understand the evolution of the adversarial distribution across training epochs, Figure 4 plots the histogram of adversary weights at fixed training epochs. Starting off from a uniform distribution, the adversary is seen to quickly infer the relative difficulty of a small fraction of labels, assigning ∼ 2× the weight on them compared to the average. In subsequent iterations the distribution becomes more concentrated, and gradually reduces the largest weights. Ablation of clipping threshold and gradient stabiliser. Figures 5 and 6 show an ablation of the choice of loss clipping threshold, and the gradient stabiliser . We see that when the clipping threshold is either too large or too small, validation performance of the model tends to suffer (albeit still better than the baseline). Similarly, we see that without any gradient stabilisation, the model's performance rapidly degrades as the adversarial radius increases. Conversely, performance also suffers when the stablisation is too high. In summary, our experiments show that our proposed DRO formulation can be effectively solved with ADVSHIFT, and results in a model that is robust to adversarial label shift.

5. DISCUSSION AND FUTURE WORK

We proposed ADVSHIFT, an algorithm for coping with label shift based on distributionally robust optimisation, and illustrated its effectiveness of real-world datasets. Despite this, our approach does not solve the problem fully. First, Figure 2 (a)(b) shows that the generalization gap increases as the perturbation radius increases. Understanding why there is a correlation between hard examples and bad generalization could improve robustness. Second, Figure 2 (a) shows that even on the train set, the algorithm threshold r does not translate to the model's level of robustness. We conjecture this results from the interplay of model expressivity and data distribution, whose future study is of interest.

A RELATED PROBLEMS

Example-level DRO. Existing work on DRO has largely focussed on the setting where P encompasses shifts in the instance space (Namkoong & Duchi, 2016; 2017; Sinha et al., 2018; Duchi & Namkoong, 2018; Levy et al., 2020) . This notion of robustness has a natural link with adversarial training (Sinha et al., 2017) , and involves a more challenging problem, as it requires parameterising the adversary's distribution. Hu et al. (2018) illustrate the potential pitfalls of DRO, owing to a mismatch between surrogate and 0-1 losses. They also propose to encode an uncertainty set based on latent label distribution shift (Storkey & Sugiyama, 2007) , which requires domain knowledge. The techniques in example-level DRO are mostly designed for small scale dataset with SVM models, as these techniques require sampling according to adversarial distribution, which can be very unstable if implemented with importance sampling only. It also requires maintaining a vector proportional to the number of labels and indexing each sample during training to match up the sample index, which is not available in most dataloading pipelines. Fairness. Adversarial label shift may be related to algorithmic fairness. Abstractly, this concerns the mitigation of systematic bias in predictions on sensitive subgroups (e.g., country of origin). One fairness criteria posits that the per-subgroup errors should be equal (Zafar et al., 2017; Donini et al., 2018) , an ideal that may be targetted by minimising the worst-subgroup error (Mohri et al., 2019; Sagawa et al., 2020) . When the subgroups correspond to labels, ensuring this notion of fairness is tantamount to guarding against an adversary that can place all mass on the worst performing label. GANs. GANs (Goodfellow et al., 2014) involve solving a min-max objective that bears some similarity to the DRO formulation (3), but is fundamentally different in details: while DRO considers reweighting of samples according to a fixed family, GANs involve a parameterised adversarial family, with the training objective augmented with an additional penalty. Domain adaptation. Label shift can be viewed as a special case of domain adaptation, where p tr and p te can systematically differ. Typically, one assumes access to a small sample from p te , which may be used to estimate importance weights (Combes et al., 2020) , or samples from multiple domains, which may be used to estimate a generic domain-agnostic representation (Muandet et al., 2013) . In causal inference, there has been interest in similar classes of models (Arjovsky et al., 2019) .

B ALGORITHM IMPLEMENTATION DETAILS

We introduce some additional details in our implementation of ADVSHIFT. First, as observed in Section 3.1, our algorithm requires knowing the empirical label distribution. As the exact value is not always available, we estimate the empirical label distribution online for all the experiments presented later in Section 4 using an exponential moving average, p emp = β • p emp + (1 -β) • p batch , where p batch is the label distribution in the minibatch. We set β = 0.999. The number is set such that the exponential moving average has a half-life roughly equal to the number of iterations in one epoch of ImageNet training using our setup. In all the experiments, we set 2γ c λ = 1 in Algorithm 1 for simplicity. For learning the adversarial distribution, we only tune the adversarial learning rate η π .

C ADDITIONAL EXPERIMENTAL RESULTS

We present here additional experimental results, including: • for ImageNet, an illustration of the lack of correlation between a label's frequency in the training set, and its validation error. (Figure 7 ) • unnormalised versions of the results on ImageNet shown in the body, where we do not subtract the baseline performance from each of the curves; this gives a sense of the absolute performance numbers obtained by each method. (Figure 8 ) • an ablation of the loss clipping threshold and gradient stabiliser as introduced above. (Figure 5 ,6) • results on CIFAR-100, to complement those for ImageNet. (Figure 9 ,10) For each plot, we vary a KL divergence threshold τ , and for a given τ construct the label distribution which results in maximal test error for the baseline model. We then compute the test error under this distribution. Note that the case τ = 0 corresponds to using the train distribution, while τ = +∞ corresponds to using the worst-case label distribution, which is concentrated on the worst-performing label. Our proposed ADVSHIFT can reduce the adversarial test error by over ∼ 2.5% over the baseline method. C.1 BALANCED LABELS =⇒ BALANCED PERFORMANCE Figure 7 shows that training label frequency does not strongly correlate with test error. Observe that several classes with a high error appear frequently in the training set. Indeed, the three classes with highest errorcasette player, maillot, and water jug -all appear an equal number of times in the training set.

C.2 UNNORMALISED PLOTS ON IMAGENET

Figure 8 presents plots of the unnormalised performance of the various methods compared in the body. Here, rather than subtract the performance of the baseline, we show the absolute accuracy of each method as the adversarial radius is varied. Evidently, the baseline and AGNOSTIC models tend to suffer in their validation error as the adversarial radius increases.

C.3 RESULTS ON CIFAR-100

Figure 9 shows results on CIFAR-100, where we train various methods using a CIFAR-ResNet-18 as the underlying architecture, Here, we see a consistent and sizable improvement from ADVSHIFT over the baseline method. On this dataset, AGNOSTIC fares better, and eventually matches the performance of ADVSHIFT with a large adversarial radius. This is in keeping with the intended use-case of AGNOSTIC, i.e., minimising the worst-case loss. Figure 10 supplements these plots with unnormalised versions, to illustrate the absolute performance differences. [e.g. Lemma 4, Faury et al. (2020) ] that claim there is a Boltzman distribution whose function value matches the optimal value of the constraint problem. Mathematically, we can have p = p but E p [l(x)] = E p [l(x)].

E PROJECTING AN ADVERSARIAL DISTRIBUTION

The projection operator in our setting aims to project a distribution p into the set P = {q : KL(q, p) ≤ r} by solving the following problem: min q q -p 2 such that i q i log(q i /p i ) ≤ r i q i = 1, ∀i, q i ≥ 0, where q i , p i denotes the i th component of q, p and n denotes number of classes. Given that our implementation is based on Tensorflow, we use the "trust-region constrained algorithm" provided by SciPy for easy integration with our python-based training procedure. However, even after extensive tuning, solving each problem up to 1% relative constraint error requires more than 1 minute when n = 1000 (the number of labels in ImageNet). This means that if we train ResNet50 on ImageNet for 100k iterations, we need to spend 100k minutes on projection operation, which is not affordable.

F PROOF OF PROPOSITION 1

Proof. We only need to show that for large enough γ c , any minimiser p * of the unconstrained problem satisfies that KL(p * , p emp ) ≤ r. Since the distance from boundary of the simplex to any interior point is +∞, we can safely assume that the point lies within the relative interior of the simplex. 

G PROOF OF LEMMA 2

Recall that we want to find π k+1 that minimises the following objective, π k+1 = argmin π∈∆ h(π) + 1 2λ (KL(π, π k ) + 2λ g k , π ) = argmin π∈∆ max 0, α c 1 + α c KL(π, p emp ) + 1 1 + α c (KL(π, π k ) -η g k , π ), where η = 1/(2γ c + 1/λ), α c = 2γ c λ. Denote v(i) as the i th component of vector v. Notice that the simplex can be written as a constraint that i π(i) = 1; ∀i, π(i) ≥ 0. Based on this constraint, we first write (6)'s Lagrangian dual L(a, b, π) = i (a i π(i)) + b( i π(i) -1) + max 0, α c 1 + α c (KL(π, p emp ) -r) + 1 1 + α c (KL(π, π k ) -η g k , π ) where π(i) denotes the i th component of π. If π k > 0 component-wise, then the optimal π cannot lie on the boundary (i.e. ∀i, π(i) > 0), which results in KL(π, π k ) = ∞. By Lagrangian duality and complementary slackness, we know that for if KL(π, p emp ) > r, 0 = ∂ ∂π(i) L(a, b, π) = b + α c 1 + α c log(π(i)/p emp (i)) + 1 1 + α c log(π(i)/π k (i)) -ηg k (i) + 1. On the other hand, if KL(π, p emp ) < r 0 = ∂ ∂π(i) L(a, b, π) = b + 1 1 + α c log(π(i)/π k (i)) -ηg k (i) + 1. We discuss the case when KL(π, p emp ) < r, and the other case follows similarly. Rearrange the optimality condition of Lagrangian multiplier, we get α 1 + α log(π(i)/p emp (i)) + 1 1 + α c log(π(i)/π k (i)) -ηg k (i) = -b -1. Since b is a constant for all coordinates, π(i) ∝ (p emp (i) α c π k (i)) 1/(1+αc) exp ( ηg k (i) 1+αc ). The result follows by noting that i π(i) = 1.

H PROOF OF THEOREM 3

For completeness, we define several terms used in optimisation. A function f (θ) is l-Lipschitz if for all θ, θ , |f (θ) -f (θ )| ≤ l θ -θ . A function f (θ) is L-smooth if for all θ, θ , ∇f (θ) -∇f (θ ) ≤ L θ -θ . A function f (θ) is L-weakly convex if f (θ) + L 2 θ 2 is convex. Then we can state the formal theorem below. Theorem 4 (formal version of Theorem 3). Under Assumptions 1-4, the update in (8) generates a sequence of points θ 1 , ..., θ T with the following property: 1 T t E[ ∇F 1/2L (θ t ) 2 ] ≤ 1 T 1/4 2 L (F 1/2L (θ 0 ) -F * 1/2L ) + 2G + G 2 + R 2 + (l 2 + σ 2 ) 1/2 ) + (h * -h(π 0 ))/T Proof. For convenience, denote Φ(θ, π) = f (θ, π) + h(π), F (θ) = max p Φ(θ, p). We start by following the standard SGD proof. Denote g θ as the stochastic gradient evaluated at step t -1 with respect to θ. Denote θ = prox F/2L (θ) := arg min w {F (w) + 2L w -θ 2 }. Conditioned on θ t-1 , we have E[ θt-1 -θ t 2 ] = θ t-1 -θt-1 2 + 2η θ E[ θt-1 -θ t-1 , g θ ] + η 2 θ E[ g θ 2 ] ≤ θ t-1 -θt-1 2 + 2η θ θt-1 -θ t-1 , ∇ θ Φ(θ t-1 , π t-1 ) + η 2 θ (L 2 + σ 2 ) (9) where the first equality follows by θ t = θ t-1 -η θ ∇ θ Φ(θ t-1 , π t-1 ). Next, we observe that θt-1 -θ t-1 , ∇ θ Φ(θ t-1 , π t-1 ) ≤ Φ( θt-1 , π t-1 ) -Φ(θ t-1 , π = F ( θt-1 ) + L( θ t-1 -θt-1 2 + 2η θ θt-1 -θ t-1 , ∇ θ Φ(θ t-1 , π t-1 ) + η 2 θ (l 2 + σ 2 )) ≤ F 1/2L (θ t-1 ) + 2Lη θ (Φ( θt-1 , π t-1 ) -Φ(θ t-1 , π t-1 ) -L 2 θt-1 -θ t-1 2 ) + Lη 2 θ (l 2 + σ 2 ) The second line substitutes in (9). The third line follows by convexity and L-smoothness. Denote that ∆ t := F ( θt-1 ) -Φ(θ t-1 , π t-1 ) ≥ Φ( θt-1 , π t-1 ) -Φ(θ t-1 , π t-1 ). We can sum over t and take expectation recursively to get, t E[ ∇F 1/2L (θ t ) 2 ] = 2L t E[ θt-1 -θ t-1 2 ] (12) ≤ 2 Lη θ (F 1/2L (θ 0 ) -F * 1/2L ) + 4 t ∆ t + T η θ (l 2 + σ 2 )) where F * 1/2L = min θ F 1/2L (θ). The first equality follows by the definition of Moreau envelope. The second inequality follows by rearranging (11). Next, we aim to bound the accumulated error t ∆ t . Recall that the update for the π is as follows for E[g π ] = ∇ π f (θ, π), π k+1 := argmin π∈∆ L {-2λ g π , π -2λh(π) + KL(π, π k )} Applying Lemma 5 with L(π) = -2λ g π , π + -2λh(π), we get The second inequality follows by the fact that KL-divergence is strongly convex with respect to L 1 norm and Cauchy-Schwartz inequality. We further observe that  -E[ g π , π k -π * (θ s ) ] = -∇ π f (θ k , π k ), π k -π * (θ s ) ≥ -f (θ k , π k ) + f (θ k , π * (θ s )) = -f (θ k , π k ) + f (θ k , π * (θ k )) -f (θ k , π * (θ k )) + f (θ k , π * (θ s )) ≥ -f (θ k , π k ) + f (θ k , π * (θ k )) -f (θ k , π * (θ k )) + f (θ s , π * (θ k )) -f (θ s , π * (θ s )) + f (θ k , π * (θ s )) ≥ -f (θ k , π k ) + f (θ k , π * (θ k )) -2l θ s -θ k



Figure 1: Distribution of per-class test errors of a ResNet-50 on ImageNet (left). While the average error rate is ∼ 24%, some classes achieve an error as high as ∼ 80%. An adversary can thus significantly degrade test performance (right) by choosing pte(y) with more weight on these classes.

Figure2: Comparison of performance on ImageNet under adversarial label distributions. For each method, we vary the KL divergence threshold τ , and for each τ report the maximal validation error induced by the adversarial shift within the threshold. Subplots (a) (b) compare the performance of ADVSHIFT trained with different DRO radius r against the default ERM training. We subtract the baseline error of ERM from all values for easy visualization. Absolute values can be found in Figure8in the Appendix. Combined with (c), (d), we see that ADVSHIFT can reduce the adversarial validation error by over ∼ 2.5% compared to the BASELINE method and is consistently superior to the AGNOSTIC, BALANCED and FIXED methods. Figure3(c) illustrates adversarial distributions for varying thresholds τ . problems in our subsequent experiments (see results for AGNOSTIC in §4). Finally, for an uncertainty set P based on the CVaR,Curi et al. (2019) provide an algorithm that updates weights using EXP3. This approach relies on a determinantal point process, which has a poor dimension-dependence.

Figure2shows the train and validation performance on ImageNet. Each curve represents the average and standard deviation across 10 independent trials. To better illustrate the differences amongst methods, we plot the difference in error to the BASELINE method. (See Figure8in the Appendix for unnormalised plots.) Subfigures (a) and (b) compre the performance of ADVSHIFT for various choices of radius r to the ERM baseline; (c) and (d) compare ADVSHIFT to the remaining methods.

Figure 3: Subplots (a) (b) show violin plots of the distribution of errors for both the BASELINE and our ADVSHIFT methods over the course of training. On the training set, ADVSHIFT significantly reduces the worst-case error, evidenced by lower upper endpoints of the distribution. On the validation set, the reduction is consistent, albeit less pronounced owing to a generalisation gap. Subplot (c) illustrates adversarial distributions at KL distances of 1, 2 and 3 for model trained with BASELINE.Even at τ = 1, the adversarial distribution is highly concentrated on only a few hard labels.

Figure4: Evolution of learned adversarial distribution (π) across training epochs. Starting off from a uniform distribution over labels, the adversary quickly infers the relative difficulty of a small fraction of labels, assigning nearly 2× the weight on them compared to the average. This distribution remains largely stable in subsequent iterations, getting gradually more concentrated as training converges.

Figure5: Ablation of loss clipping threshold. We see that when the clipping threshold is either too large or too small, validation performance of the model tends to suffer.

Figure 8: Comparison of performance of various methods on ImageNet under adversarial label distributions.For each plot, we vary a KL divergence threshold τ , and for a given τ construct the label distribution which results in maximal test error for the baseline model. We then compute the test error under this distribution. Note that the case τ = 0 corresponds to using the train distribution, while τ = +∞ corresponds to using the worst-case label distribution, which is concentrated on the worst-performing label. Our proposed ADVSHIFT can reduce the adversarial test error by over ∼ 2.5% over the baseline method.

h(π * (θ s )) -g π , π * (θ s ) + KL(π * , π k ) ≥ -g π , π k+1 -h(π k+1 ) + KL(π k+1 , π k ) + KL(π * , π k+1 ) Rearrange and take expectation we get2λ(E[ -g π , π k -π * (θ s ) ] -E[h(π k+1 )] + E[h(π * (θ s ))]) ≤ -E[KL(π k+1 , π k )] + KL(π * , π k+1 ) -KL(π * , π k ) + 2λ(E[ g π , π k -π k+1 ]) ≤ -E[KL(π k , π * )] + E[KL(π * , π k+1 )] -π * -π k 2 1 /2 + 2λ 2 E[ g π 2 ∞ ] + π * -π k 2 1 /2

The first inequality follows by concavity. The third line follows byf (θ s , π * (θ k )) ≤ f (θ s , π * (θ s )).The last inequality follows by Lipschitzness. Similarly,-h(θ k , π k ) + h(θ k , π * (θ s )) ≥ -h(θ k , π k ) + h(θ k , π * (θ k )) -2l θ s -θ kWe can take iterative expectation and get sum over k = s + 1, ..., s + B to gets+B k=s+1 E[-f (θ k , π k ) + f (θ k , π * (θ k )) -h(π k ) + h(π * (θ k ))] ≤ -h(π s ) + E[h(π s+B )] + 4l s+B k=s+1 k j=s E[ θ j -θ j+1 ] + λBG 2 + 1 2λ (E[KL(π * (θ s ), π 0 )] -E[KL(π * (θ s ), π s )])

Algorithm 1 ADVSHIFT(θ 0 , γ c , λ, NNOpt, p emp , η π )

Illustration that training label frequency does not strongly correlate with test error. Observe that several classes with a high error appear frequently in the training set.

The first line follows by convexity and L-smoothness. The second line by definition of F . The third line by definition of θ. Next, by definition of Moreau envelop,F 1/2L (θ t ) ≤ F ( θt-1 ) + L θt-1 -θ t

annex

We remark here that the choice of a CIFAR-ResNet-18 results in an underparameterised model, which does not perfectly fit the training data. In the overparameterised case, there are challenges with employing DRO, as noted by Sagawa et al. (2020) . Addressing these challenges in settings where the training data is balanced remains an interesting open question.

D CONSTRAINED DRO DOES NOT PERMIT A BOLTZMAN SOLUTION

We start with a simple example with three label classes {a, b, c} with class losses l = {1, 2, 4} respectively. We assume an uniform empirical distribution, i.e. p emp = {1/3, 1/3, 1/3}. We consider two different problems. The first is to find the optimal solution to regularised objective,This problem is well known (e.g. see 2.7.2 of lecture ) to permit a solution of form p(x) = exp lx/tx ∈{a,b,c} exp l x /t for some t. In contrast, we show that distributions of the above form does not solve the constrained version of the problem. In particular, we consider the following optimisation problem:x ∈{a,b,c} exp l x /t , then we know forWe solve the above problem with a convex optimizer using r = 0.01 and found p a = 0.283, p b = 0.322, p c = 0.395.(log(p a ) -log(p c ))/(log(p b ) -log(p c )) = 1.64 = (l a -l c )/(l b -l c ) = 1.5. Note that the above example shows that not all solutions of the contrained problem can be written a Boltzman distribution, i.e. p(x) = exp lx/tx ∈{a,b,c} exp l x /t . Yet, this does not contradict with results

Note that ∆

By further sum over all blocks and divide by total number of iterations T , we getSubstitute the above inequality into (12) and we get, then we see thatLemma 5. For any differentiable convex function L, if x * = argmin x∈∆ {L(x) + KL(x, x 0 )}, then for any x ∈ ∆, we haveProof. This Lemma is well-known, but we include a proof for completeness. By optimality of x * and convexity of ∆, we know thatwhere φ(x) = i x i log(x i ), and the Bregman divergence defined according to φ is KL-divergence. Then (x ) ≥ (x * ) + ∇ (x * ), x -x * ≥ (x * ) + ∇φ(x 0 ) -∇φ(x * ), x -x * = (x * ) -∇φ(x 0 ), x * -x 0 + φ(x * ) -φ(x 0 )+ ∇φ(x 0 ), x -x 0 + φ(x ) -φ(x 0 ) -∇φ(x * ), x -x * + φ(x ) -φ(x * ) = (x * ) + KL(x * , x 0 ) -KL(x , x 0 ) + KL(x , x * )

