SHIFTS 2.0: EXTENDING THE DATASET OF REAL DISTRI-BUTIONAL SHIFTS

Abstract

Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. The Shifts dataset stands out in terms of the diversity of tasks and data modalities it features. Unlike most benchmarks, which are dominated by 2D image data, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and strict safety requirements due to the high cost of errors. These new datasets will allow researchers to explore robust generalization and uncertainty estimation in new situations. This work provides a description of the dataset and baseline results for both tasks.

1. INTRODUCTION

In machine learning it is commonly assumed that training, validation, and test data are independent and identically distributed, implying that good testing performance is a strong predictor of model performance in deployment. This assumption seldom holds in real, "in the wild" applications. Real-world data are subject to a wide range of possible distributional shifts -mismatches between the training data and the test or deployment data (Quiñonero-Candela, 2009; Koh et al., 2020; Malinin et al., 2021) . In general, the greater the degree of the shift in data, the poorer the model performance on it. While most ML practitioners have faced the issue of mismatched training and test data at some point, the issue is especially acute in highrisk industrial applications such as finance, medicine, and autonomous vehicles, where a mistake by the ML system may incur significant financial, reputational and/or human loss. Ideally, ML models should demonstrate robust generalization under a broad range of distributional shifts. However, it is impossible to be robust to all forms of shifts due to the no free lunch theorem (Murphy, 2012). ML models should therefore indicate when they fail to generalize via uncertainty estimates, which enables us to take actions to improve the safety of the ML system, deferring to human judgement (Amodei et al., 2016; Malinin, 2019) , deploying active learning (Settles, 2009; Kirsch et al., 2019) or propagating the uncertainty through an ML pipeline (Nair et al., 2020) . Unfortunately, standard benchmarks, which contain i.i.d training and held-out data, do not allow robustness to distributional shift and uncertainty quality to be assessed. Thus, there is an acute need for dedicated benchmarks which are designed to assess both properties. Recently, several dedicated benchmarks for assessing generalisation under distributional shift and uncertainty estimation have appeared. Specifically, the ImageNet and the associated A, R, C, and O versions of ImageNet (Hendrycks et al., 2020; Hendrycks & Dietterich, 2019; Hendrycks et al., 2021) , the WILDS collection of datasets (Koh et al., 2020) , the Diabetic Retinopathy dataset (Filos et al., 2019) and the ML uncertainty benchmarks (Nado et al., 2021) . The Shifts Dataset (Malinin et al., 2021) is also a recent benchmark for jointly assessing the robustness of generalisation and uncertainty quality. It is a large, industrially sourced data with examples of real distributional shifts, from three very different data modalities and four different predictive tasks -specifically a tabular weather forecasting task (classification and regression), a text-based translation task (discrete autoregressive prediction) and a vehicle motion-prediction task (continuous autoregressive prediction). The principle difference of Shifts from the other benchmarks is that it was specifically constructed to examine data modalities and predictive tasks which are not 2D image classification, which is a task-modality combination that dominates the other benchmarks described above. This work extends the Shifts Dataset with two new datasets sourced from high-risk healthcare and industrial tasks of great societal importance: segmentation of white matter Multiple Sclerosis (MS) lesions in Magnetic Resonance (MR) brain scans and the estimation of power consumption by marine cargo vessels. These datasets constitute distinct examples of data modalities and predictive tasks that are still scarce in the field. The former represents a structured prediction task for 3D imaging data, which is novel to Shifts, and the latter a tabular regression task. Both tasks feature ubiquitous real-world distributional shifts and strict requirements for robustness and reliability due to the high cost of errors. For both datasets we assess ensemble-based baselines in terms of the robustness of generalisation and uncertainty quality.

2. BENCHMARK PARADIGM, EVALUATION AND CHOICE OF BASELINES

Paradigm The Shifts benchmark views robustness and uncertainty estimation as having equal importance. Models should be robust to as broad a range of distributional shifts as possible. However, through the no free lunch theorem, we know that we cant construct models which are guaranteed to be universally robust to all plausible shifts. Thus, where models fail to robustly generalise, they should yield high estimates of uncertainty, enabling risk-mitigating actions to be taken. Robustness and uncertainty estimation become two sides of the same coin and need to be assessed jointly. The Shifts Dataset was originally constructed with the following attributes. First, the data is structured to have a 'canonical partitioning' such that there are in-domain, or 'matched' training, development and evaluation datasets, as well as a shifted development and evaluation dataset. The latter two datasets are also shifted relative to each other. Models are assessed on the joint in-domain and shifted development or evaluation datasets. This is because a model may be robust to certain examples of distributional shifts and yield accurate, low uncertainty predictions, and also perform poorly and yield high estimates of uncertainty on data matched to the training set. Providing a dataset which contains both matched and shifted data enables better evaluation of this scenario. Second, it is assumed that at training or test time it is not known a priori about whether or how the data is shifted. This emulates real-world deployments in which the variation of conditions cannot be sufficiently covered with data and is a more challenging scenario than one in which information about nature of shift is available (Koh et al., 2020) . In this work we maintain these attributes. Evaluation Robustness and uncertainty quality are jointly assessed via error-retention curves (Malinin, 2019; Lakshminarayanan et al., 2017; Malinin et al., 2021) . Given an error metric, error-retention curves depict the error over a dataset as a model's predictions are replaced by ground-truth labels in order of decreasing uncertainty. The area under this curve can be decreased either by improving the predictive performance of the model, such that it has lower overall error, or by providing uncertainty estimates which are better cor-

