COPING WITH LABEL SHIFT VIA DISTRIBUTIONALLY ROBUST OPTIMISATION

Abstract

The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an unlabelled test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in multiple test environments. Can one instead learn a single classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.

1. INTRODUCTION

Classical supervised learning involves learning a model from a training distribution that generalises well on test samples drawn from the same distribution. While the assumption of identical train and test distributions has given rise to useful methods, it is often violated in many practical settings (Kouw & Loog, 2018) . The label shift problem is one such important setting, wherein the training distribution over the labels does not reflect what is observed during testing (Saerens et al., 2002) . For example, consider the problem of object detection in self-driving cars: a model trained in one city may see a vastly different distribution of pedestrians and cars when deployed in a different city. Such shifts in label distribution can significantly degrade model performance. As a concrete example, consider the performance of a ResNet-50 model on ImageNet. While the overall error rate is ∼ 24%, Figure 1 reveals that certain classes suffer an error as high as ∼ 80%. Consequently, a label shift that increases the prevalence of the more erroneous classes in the test set can significantly degrade performance. Most existing work on label shift operates in the setting where one has an unlabelled test sample that can be used to estimate the shifted label probabilities (du Plessis & Sugiyama, 2014; Lipton et al., 2018; Azizzadenesheli et al., 2019) . Subsequently, one can retrain a classifier using these probabilities in place of the training label probabilities. While such techniques have proven effective, it is not always feasible to access an unlabelled set. Further, one may wish to deploy a learned model in multiple test environments, each one of which has its own label distribution. For example, the label distribution for a vehicle detection camera may change continuously while driving across the city. Instead of simply deploying a separate model for each scenario, deploying a single model that is robust to shifts may be more efficient and practical. Hence, we address the following question in this work: can we learn a single classifier that is robust to a family of arbitrary shifts? We answer the above question by modeling label shift via distributionally robust optimisation (DRO) (Shapiro et al., 2014; Rahimian & Mehrotra, 2019) . DRO offers a convenient way of coping with distribution shift, and have lead to successful applications (e.g. Faury et al. ( 2020)). Intuitively, by seeking a model that performs well on all label distributions that are "close" to the training data label distribution, this task can be cast as a game between the learner and an adversary, with the latter allowed to pick label distributions that maximise the learner's loss. We remark that while adversarial perspectives have informed popular paradigms such as GANs, these pursue fundamentally different objectives from DRO (see Appendix A for details). Although several previous works have explored DRO for tackling the problem of example shift (e.g., adversarial examples) (Namkoong & Duchi, 2016; 2017; Duchi & Namkoong, 2018) , an application of DRO to the label shift setting poses several challenges: (a) updating the adversary's distribution naïvely requires solving a nontrivial convex optimisation subproblem with limited tractability, and also needs careful parameter tuning; and (b) naïvely estimating gradients under the adversarial distribution on a randomly sampled minibatch can lead to unstable behaviour (see §3.1). We overcome these challenges by proposing the first algorithm that successfully optimises a DRO objective for label shift on a large scale dataset (i.e., ImageNet). Our objective encourages robustness to arbitrary label distribution shifts within a KL-divergence ball of the empirical label distribution. Importantly, we show that this choice of robustness set admits an efficient and stable update step.

Summary of contributions

(1) We design a gradient descent-proximal mirror ascent algorithm tailored for optimising large-scale problems with minimal computational overhead, and prove its theoretical convergence. (2) With the proposed algorithm, we implement a practical procedure to successfully optimise the robust objective on ImageNet scale for the label shift application. (3) We show through experiments on ImageNet and CIFAR-100 that our technique significantly improves over baselines when the label distribution is adversarially varied.

2. BACKGROUND AND PROBLEM FORMULATION

In this section we formalise the label shift problem and motivate its formulation as an adversarial optimisation problem. (1) The assumption underlying the above formulation is that test samples are drawn from the same distribution p tr that is used during training. However, this assumption is violated in many practical settings. The problem of learning from a training distribution p tr , while attempting to perform well on a test distribution p te = p tr is referred to as domain adaptation (Ben-David et al., 2007) . In



Figure 1: Distribution of per-class test errors of a ResNet-50 on ImageNet (left). While the average error rate is ∼ 24%, some classes achieve an error as high as ∼ 80%. An adversary can thus significantly degrade test performance (right) by choosing pte(y) with more weight on these classes.

Consider a multiclass classification problem with distribution p tr over instances X and labels Y = [L]. The goal is to learn a classifier h θ : X → Y parameterised by θ ∈ Θ, with the aim of ensuring good predictive performance on future samples drawn from p tr . More formally, the goal is to minimise the objective min θ E (x,y)∼ptr [ (x, y, θ)], where : X × Y × Θ → R + is a loss function. In practice, we only have access to a finite sample S = {(x i , y i )} n i=1 ∼ p n tr , which motivates us to use the empirical distribution p emp (x, y) = 1 n n i=1 1(x = x i , y = y i ) in place of p tr . Doing so, we arrive at the objective of minimising the empirical risk: min θ E pemp [ (x, y, θ)] := 1 n n i=1 (x i , y i , θ).

