INTRODUCING SAMPLE ROBUSTNESS

Abstract

Choosing the right data and model for a pre-defined task is one of the critical competencies in machine learning. Investigating what features of a dataset and its underlying distribution a model decodes may enlighten the mysterious "black box" and guide us to a deeper and more profound understanding of the ongoing processes. Furthermore, it will help to improve the quality of models which directly depend on data or learn from it through training. In this work, we introduce the dataset-dependent concept of sample robustness, which is based on a pointwise Lipschitz constant of the label map. For a particular sample, it measures how small of a perturbation is required to cause a label-change relative to the magnitude of the label map. We introduce theory to motivate the concept and to analyse the effects of having similar robustness distributions for the training-and test data. Afterwards, we conduct various experiments using different datasets and (non-)deterministic models. In some cases, we can boost performance by choosing specifically tailored training(sub)sets and hyperparameters depending on the robustness distribution of the test(sub)sets.

1. INTRODUCTION:

In the age of automated machine learning, we shift our focus evermore towards regarding metahyperparameters such as model-type or training-and validation budget as variables of a loss function in the most abstract sense. For training sets, however, the mere number of samples often determines how well suited it is perceived for a particular task. The motivation of this work is to introduce a concept allowing for the use of datasets as variables of such a generalised loss function as well. Imagine, for example, patients who share almost identical medical records, but react differently to some prescribed treatment. They may pose a challenge to machine learning models similar to what is known as natural adversarial examples in vision tasks, see Hendrycks et al. (2019) . The relationship between medical features and appropriate personal treatments may be sensitive towards small input variations, i.e. not robust towards perturbations. In this work, we are to the best of our knowledge the first to introduce and analyse a model-agnostic measure for the robustness of data. We show that knowledge about the robustness distribution of a specific test(sub)set can allow for choosing a more appropriate training(sub)set in terms of performance optimisation. Finally, we discover that the optimal choice of hyperparameters may also depend on the robustness distributions of both training-and test data. Let us motivate the concept of sample robustness first on a high level. When collecting and processing a dataset for a pre-defined task, we identify certain features and expressive samples such that a model may be able to abstract and generalise from these finite points to the whole space of possible inputs. Assume we have a certain rectangle-shaped data-distribution in a circle-shaped feature space and a dataset labelled according to two distinct ground truth maps y * ∈ {×, * } and z * ∈ {•, } (comp. Figure 1 ). Here, one can imagine classifying images of horses and cats (assuming ground truth y * ) and classifying images of animal-human pairs (assuming ground truth z * ). Evidently, the distance between differently labelled samples depends on the ground truth map labelling them. For every sample in a dataset, the intrinsic information of closeness to a differently labelled sample can be considered a feature itself. For regression tasks and label distributions which are not necessarily categorical one may also include the distance of the corresponding labels as additional information. By taking the quotient of these two and maximising it over the dataset, i.e. calculating a point-wise Lipschitz constant of the label map, one can measure how sensitive a sample is to label- 1.1 OUTLINE After citing and discussing related work concerned with decision boundaries, model robustness and Lipschitz calculus in section 2, we introduce the mathematical framework and the measure of sample robustness in section 3. We also motivate the concept theoretically and show some natural relations to K-Nearest Neighbour models. Section 4 is completely devoted to the evaluation using different datasets (CIFAR-10, Online News Popularity Dataset) and models (Convolutional Neural Networks, K-Nearest Neighbour, Random Forest). Section 5 finally concludes the findings and discusses other research approaches. Letters A -F refer to sections in the appendix.

2. RELATED WORK:

Analysing the data-distribution before training yields a way to investigate (and boost) model performance from an earlier stage as is done by unsupervised pre-training (Erhan et al., 2010) ). Many algorithms stemming from the unsupervised setting (Chigirev & Bialek, 2003; Cayton, 2005) are devoted to extract information about the so-called data-manifold (Fefferman et al., 2013; Bernstein & Kuleshov, 2014) . Decoding the latent features which determine the data-distribution (Bengio et al., 2013) provides valuable insight and helps to understand the decision boundaries which a model learns throughout the training phase. Furthermore, understanding the data-manifold may provide a view into the "black box" transforming inputs to outputs (Fawzi et al., 2016; Liu et al., 2017; Biggio et al., 2013) . In this work, we want to use the intrinsic information of distance between samples in feature space and relate it to the distance of the corresponding labels to introduce a new dataset-dependent feature. The robustness of a sample can be regarded as its susceptibility to label-changing perturbations. Here, one is immediately reminded of adversarial examples (Szegedy et al., 2013) in the context of model robustness. The difference to our proposed concept, however, is that we only use the pre-defined labels instead of model predictions as additional input. The term robustness itself is one of the most prominent throughout the recent literature in many different contexts, from robust attacks to robust models/defences (Evtimov et al., 2017; Beggel et al., 2019; Madry et al., 2017; Weng et al., 2018; Tsuzuku et al., 2018) . State-of-the-art machine learning models are susceptible to noise, especially so when crafted purposefully (Fawzi et al., 2016) . It leaves these powerful machines vulnerable to attacks either at training- (Zhu et al., 2019) or at test stage (Biggio et al., 2013) , independent of the architecture used (Papernot et al., 2016b) . We follow the idea that models are extensions of the label map from the (metric) subspace defined by a dataset to the whole feature space. Hence, they will inherit critical properties from the data. In this work we analyse the robustness distribution of datasets and the thereon dependent performances of models, but plan on investigating the connection to model-robustness in the future. Lipschitz calculus yields a mathematically well-understood approach to describe and measure model robustness as in Weng et al. (2018) or Tsuzuku et al. (2018) . Framing machine learning theory in terms of metric spaces (and also building robustness concepts thereon) has been done before in Wang et al. (2016) , however, not explicitly connecting it to Lipschitz calculus. In this work, we



Figure1: Images from the CIFAR-10 dataset labelled "horse" and "cat", respectively.

