SHIFTS 2.0: EXTENDING THE DATASET OF REAL DISTRI-BUTIONAL SHIFTS

Abstract

Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. The Shifts dataset stands out in terms of the diversity of tasks and data modalities it features. Unlike most benchmarks, which are dominated by 2D image data, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and strict safety requirements due to the high cost of errors. These new datasets will allow researchers to explore robust generalization and uncertainty estimation in new situations. This work provides a description of the dataset and baseline results for both tasks.

1. INTRODUCTION

In machine learning it is commonly assumed that training, validation, and test data are independent and identically distributed, implying that good testing performance is a strong predictor of model performance in deployment. This assumption seldom holds in real, "in the wild" applications. Real-world data are subject to a wide range of possible distributional shifts -mismatches between the training data and the test or deployment data (Quiñonero-Candela, 2009; Koh et al., 2020; Malinin et al., 2021) . In general, the greater the degree of the shift in data, the poorer the model performance on it. While most ML practitioners have faced the issue of mismatched training and test data at some point, the issue is especially acute in highrisk industrial applications such as finance, medicine, and autonomous vehicles, where a mistake by the ML system may incur significant financial, reputational and/or human loss. Ideally, ML models should demonstrate robust generalization under a broad range of distributional shifts. However, it is impossible to be robust to all forms of shifts due to the no free lunch theorem (Murphy, 2012). ML models should therefore indicate when they fail to generalize via uncertainty estimates, which enables us to take actions to improve the safety of the ML system, deferring to human judgement (Amodei et al., 2016; Malinin, 2019) , deploying active learning (Settles, 2009; Kirsch et al., 2019) or propagating the uncertainty through an ML pipeline (Nair et al., 2020) . Unfortunately, standard benchmarks, which contain i.i.d training

