COPING WITH LABEL SHIFT VIA DISTRIBUTIONALLY ROBUST OPTIMISATION

Abstract

The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an unlabelled test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in multiple test environments. Can one instead learn a single classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.

1. INTRODUCTION

Classical supervised learning involves learning a model from a training distribution that generalises well on test samples drawn from the same distribution. While the assumption of identical train and test distributions has given rise to useful methods, it is often violated in many practical settings (Kouw & Loog, 2018) . The label shift problem is one such important setting, wherein the training distribution over the labels does not reflect what is observed during testing (Saerens et al., 2002) . For example, consider the problem of object detection in self-driving cars: a model trained in one city may see a vastly different distribution of pedestrians and cars when deployed in a different city. Such shifts in label distribution can significantly degrade model performance. As a concrete example, consider the performance of a ResNet-50 model on ImageNet. While the overall error rate is ∼ 24%, Figure 1 reveals that certain classes suffer an error as high as ∼ 80%. Consequently, a label shift that increases the prevalence of the more erroneous classes in the test set can significantly degrade performance. Most existing work on label shift operates in the setting where one has an unlabelled test sample that can be used to estimate the shifted label probabilities (du Plessis & Sugiyama, 2014; Lipton et al., 2018; Azizzadenesheli et al., 2019) . Subsequently, one can retrain a classifier using these probabilities in place of the training label probabilities. While such techniques have proven effective, it is not always feasible to access an unlabelled set. Further, one may wish to deploy a learned model in multiple test environments, each one of which has its own label distribution. For example, the label distribution for a vehicle detection camera may change continuously while driving across the city. Instead of simply deploying a separate model for each scenario, deploying a single model that is

