INTRODUCING SAMPLE ROBUSTNESS

Abstract

Choosing the right data and model for a pre-defined task is one of the critical competencies in machine learning. Investigating what features of a dataset and its underlying distribution a model decodes may enlighten the mysterious "black box" and guide us to a deeper and more profound understanding of the ongoing processes. Furthermore, it will help to improve the quality of models which directly depend on data or learn from it through training. In this work, we introduce the dataset-dependent concept of sample robustness, which is based on a pointwise Lipschitz constant of the label map. For a particular sample, it measures how small of a perturbation is required to cause a label-change relative to the magnitude of the label map. We introduce theory to motivate the concept and to analyse the effects of having similar robustness distributions for the training-and test data. Afterwards, we conduct various experiments using different datasets and (non-)deterministic models. In some cases, we can boost performance by choosing specifically tailored training(sub)sets and hyperparameters depending on the robustness distribution of the test(sub)sets.

1. INTRODUCTION:

In the age of automated machine learning, we shift our focus evermore towards regarding metahyperparameters such as model-type or training-and validation budget as variables of a loss function in the most abstract sense. For training sets, however, the mere number of samples often determines how well suited it is perceived for a particular task. The motivation of this work is to introduce a concept allowing for the use of datasets as variables of such a generalised loss function as well. Imagine, for example, patients who share almost identical medical records, but react differently to some prescribed treatment. They may pose a challenge to machine learning models similar to what is known as natural adversarial examples in vision tasks, see Hendrycks et al. (2019) . The relationship between medical features and appropriate personal treatments may be sensitive towards small input variations, i.e. not robust towards perturbations. In this work, we are to the best of our knowledge the first to introduce and analyse a model-agnostic measure for the robustness of data. We show that knowledge about the robustness distribution of a specific test(sub)set can allow for choosing a more appropriate training(sub)set in terms of performance optimisation. Finally, we discover that the optimal choice of hyperparameters may also depend on the robustness distributions of both training-and test data. Let us motivate the concept of sample robustness first on a high level. When collecting and processing a dataset for a pre-defined task, we identify certain features and expressive samples such that a model may be able to abstract and generalise from these finite points to the whole space of possible inputs. Assume we have a certain rectangle-shaped data-distribution in a circle-shaped feature space and a dataset labelled according to two distinct ground truth maps y * ∈ {×, * } and z * ∈ {•, } (comp. Figure 1 ). Here, one can imagine classifying images of horses and cats (assuming ground truth y * ) and classifying images of animal-human pairs (assuming ground truth z * ). Evidently, the distance between differently labelled samples depends on the ground truth map labelling them. For every sample in a dataset, the intrinsic information of closeness to a differently labelled sample can be considered a feature itself. For regression tasks and label distributions which are not necessarily categorical one may also include the distance of the corresponding labels as additional information. By taking the quotient of these two and maximising it over the dataset, i.e. calculating a point-wise Lipschitz constant of the label map, one can measure how sensitive a sample is to label- 1.1 OUTLINE After citing and discussing related work concerned with decision boundaries, model robustness and Lipschitz calculus in section 2, we introduce the mathematical framework and the measure of sample robustness in section 3. We also motivate the concept theoretically and show some natural relations to K-Nearest Neighbour models. Section 4 is completely devoted to the evaluation using different datasets (CIFAR-10, Online News Popularity Dataset) and models (Convolutional Neural Networks, K-Nearest Neighbour, Random Forest). Section 5 finally concludes the findings and discusses other research approaches. Letters A -F refer to sections in the appendix.

2. RELATED WORK:

Analysing the data-distribution before training yields a way to investigate (and boost) model performance from an earlier stage as is done by unsupervised pre-training (Erhan et al., 2010) ). Many algorithms stemming from the unsupervised setting (Chigirev & Bialek, 2003; Cayton, 2005) are devoted to extract information about the so-called data-manifold (Fefferman et al., 2013; Bernstein & Kuleshov, 2014) . Decoding the latent features which determine the data-distribution (Bengio et al., 2013) provides valuable insight and helps to understand the decision boundaries which a model learns throughout the training phase. Furthermore, understanding the data-manifold may provide a view into the "black box" transforming inputs to outputs (Fawzi et al., 2016; Liu et al., 2017; Biggio et al., 2013) . In this work, we want to use the intrinsic information of distance between samples in feature space and relate it to the distance of the corresponding labels to introduce a new dataset-dependent feature. The robustness of a sample can be regarded as its susceptibility to label-changing perturbations. Here, one is immediately reminded of adversarial examples (Szegedy et al., 2013) in the context of model robustness. The difference to our proposed concept, however, is that we only use the pre-defined labels instead of model predictions as additional input. The term robustness itself is one of the most prominent throughout the recent literature in many different contexts, from robust attacks to robust models/defences (Evtimov et al., 2017; Beggel et al., 2019; Madry et al., 2017; Weng et al., 2018; Tsuzuku et al., 2018) . State-of-the-art machine learning models are susceptible to noise, especially so when crafted purposefully (Fawzi et al., 2016) . It leaves these powerful machines vulnerable to attacks either at training- (Zhu et al., 2019) or at test stage (Biggio et al., 2013) , independent of the architecture used (Papernot et al., 2016b) . We follow the idea that models are extensions of the label map from the (metric) subspace defined by a dataset to the whole feature space. Hence, they will inherit critical properties from the data. In this work we analyse the robustness distribution of datasets and the thereon dependent performances of models, but plan on investigating the connection to model-robustness in the future. Lipschitz calculus yields a mathematically well-understood approach to describe and measure model robustness as in Weng et al. (2018) or Tsuzuku et al. (2018) . Framing machine learning theory in terms of metric spaces (and also building robustness concepts thereon) has been done before in Wang et al. (2016) , however, not explicitly connecting it to Lipschitz calculus. In this work, we build the concept of sample robustness based on a point-wise Lipschitz constant of the label map for metric-and Banach spaces such that it applies to a wide range of different feature-and label spaces including an ample variety of metrics. Finally, investigating the dependence of model-hyperparameters on data has been done previously (Nakkiran et al., 2020) . The authors also showed that more data could sometimes decrease model performance. In this work, we will see similar results regarding these two aspects with the difference that we could identify such data even before training using the proposed measure of sample robustness.

3. SAMPLE ROBUSTNESS

Now we will introduce the primary concept of this work, namely sample robustness. It measures how sensitive samples are to label-changing perturbations relative to the magnitude of the label map.

3.1. DEFINITION: (FRAMEWORK)

Let FS be a feature space, i.e. a metric space with metric d := d F S , let T S be a target-or label space, i.e. a real Banach space with norm • := • T S , let x, t ⊆ F S be finite (data)sets with #x, #t ≥ 2 and let y : x ∪ t → T S be a map of labels with #y(x) = #y(t) ≥ 2.

3.2. DEFINITION: (REACH)

Let x ∈ FS be a sample with label y(x), y(x) ≤ y ∞ , and Γ x (x) := {x ∈ x | y(x) = y(x)}. The reach of x is defined as the distance of x to Γ x (x): r x (x) := dist x, Γ x (x) = min x∈Γx(x) d(x, x) In other words: Γ x (x) is the set of samples x in the dataset x that are labelled differently from x. One may notice that x ∈ Γ x (x) ⇔ x ∈ Γ x (x). The reach of x ∈ x is exactly the minimal distance of the point representing the image of a "cat" to a differently labelled (coloured) point in Figure . 1. If x is "close" to x, one would expect their labels y(x) and y(x) to be "close" as well. Taking the quotient then gives a measure of this error (comp. 3.3) and normalising the latter with respect to the magnitude of y (using y ∞ := max x∈x y(x) ) yields the main concept of sample robustness in 3.4.

3.3. DEFINITION: (POINT-WISE LIPSCHITZ CONSTANT)

For x ∈ FS with label y(x), y(x) ≤ y ∞ , and Γ x (x) as in 3.2 one defines a point-wise Lipschitz constant of y as: Q x (x) := max x∈Γx(x) y(x) -y(x) d(x, x)

3.4. DEFINITION AND PROPOSITION: (SAMPLE ROBUSTNESS)

Let x ∈ FS with label y(x), y(x) ≤ y ∞ , and Γ x (x) as in 3.2. The robustness of the sample x in x with respect to d is defined as: R x (x) := y ∞ Q x (x) + y ∞ ∈ (0, 1) It is independent of rescaling the label map y (comp. A.2). Coming back to the example in the introduction, we can see now that the almost identical medical records of patients reacting differently to the same treatment are considered as less robust samples in the above sense.

3.5. THEORETIC MOTIVATION AND BACKGROUND

Assume we have datasets x and t where both are labelled using the same label map y with max x∈x y(x) = max t∈t y(t) . For any z ∈ x ∪ t it holds that: Q x ∪ t (z) = max{Q x (z), Q t (z)} ⇔ R x ∪ t (z) = min{R x (z), R t (z)} In other words: the closer R x (z) is to R t (z), the closer both values are to the robustness of z in the union x ∪ t. It follows: R x (z) ≈ R t (z) ⇒ R x (z) ≈ R x ∪ t (z) ≈ R t (z), where at least one side is an equality. For convenience, we write x ∼ R t :⇔ R x (z) ≈ R t (z) ∀ z ∈ x ∪ t. Assume now that F is an extension of the label map y from x to x ∪ tfoot_0 . For given t one can downsize the set x to x ⊂ x in order to align both robustness distributions; however, there will likely be a trade-off between this alignment and the distance of F to y as maps on t, because there are less points to extend from (therefore allowing for a higher variance). For such an extension F it holds that F |x ≡ y |x , thus: Q x (x) = max x∈Γx(x) y(x) -y(x) d(x, x) = max x∈Γx(x) F (x) -F (x) d(x, x) ∀ x ∈ x Assuming x ∼ R t then enables the following conclusion for z ∈ x ∪ t: ( * ) Q x ∪ t (z) ≈ Q x (z) = max z∈Γx(z) y(z) -y(z) d(z, z) = max z∈Γx(z) F (z) -y(z) + z d(z, z) , where z := y(z)-F (z). Notably, the rights side depends at most on one point outside x. Therefore it includes at most one z compared to the naive approach including both z and z : Q x ∪ t (z) = max z∈Γx ∪ t (z) y(z) -y(z) d(z, z) = max z∈Γx ∪ t (z) F (z) -F (z) + ( z -z ) d(z, z) To summarize: by assuming x ∼ R t one can find a small γ z such that Q x ∪ t (z) = Q x (z) + γ z and trade z ∈ T S for γ z ∈ R. But whereas the first depends on the extension F , the latter only depends on the data. Let now L(F -y) be the Lipschitz constant of the map F -y on x ∪ t. Using ( * ) and the reverse triangle inequality one can derive the following (comp. A.4): L(F -y) ≥ max z∈x ∪ t max z∈Γx ∪ t (z) y(z) -y(z) d(z, z) -max z∈Γx ∪ t (z) F (z) -F (z) d(z, z) = max z∈x ∪ t max z∈Γx(z) F (z) -F (z) + z d(z, z) + γ z -max z∈Γx ∪ t (z) F (z) -F (z) d(z, z) Hence, z and γ z determine a lower bound on L(F -y) and decreasing it a priori may allow for finding an extension F that minimizes both F -y ∞ and L(F -y) at the same time. By aligning the robustness distributions of x and t this bound will not only depend on the extension F , but on the data (trading the possibly uncontrollable for the controllable). Finally, the true motive for such an approach stems from functional analysis: the space of Lipschitz functions from x ∪ t to T S, i.e. Lip(x ∪ t, T S), is a Banach space with respect to the norm • sum := • ∞ + L(•) (comp. Cobzas ¸et al. ( 2019)). So by regarding extensions F of y that minimise F -y sum we a priori restrict the size of the hypothesis space from arbitrary maps to Lipschitz maps. The completeness of Lip(x ∪ t, T S) is of importance as it prevents sequences of extensions (F n ) with F n -y sum → 0 to exit this smaller spacefoot_1 .

3.6. SAMPLE ROBUSTNESS AND KNN

Let F (z) := K i=1 ω i y z i be a K-nearest-neighbour model with reference set x, where z i is the i-th nearest neighbour of z in x with weight ω i . The formula ( * ) from 3.5 translates to: max z∈Γx ∪ t (z) y(z) -y(z) d(z, z) ≈ max z∈Γx(z) K i=1 ω i y z i -y(z) + z d(z, z) Hence, by assuming x ∼ R t we have imposed a constraint on the (deterministic) model. More precisely, it is forced to base its prediction of z on those K-samples z i close to z for which the weighted linear combination of their labels suffices the above formula a priori. If z is among the most robust samples, we know that a small change in feature space will only cause a small change in label space. Therefore we expect that a higher K produces a higher accuracy as samples close to z will be more reliable predictors. Conversely, this is not the case for less robust samples as can be seen by comparing prediction and true label: F (z) -y(z) = K i=1 ω i y z i -y(z) If there exists a z * close to z such that the quotient y(z * )-y(z) d(z * ,z) is large, then by increasing K we are more likely to find a z i close or equal to z * causing the difference to grow. Another interesting fact is that less robust samples can by construction be considered as natural adversarial examples for a KNN model (Hendrycks et al., 2019) . Assume for simplicity a classification task with K=1: if x and t show similar robustness distributions and x ∈ x is among the less robust samples, then there likely exists a sample t a ∈ t such that d(x, t a ) is small and y(x) = y(t a ) (the true labels). However, x and t a being close may cause the prediction y arg min z∈x d(t a , z) to be equal to y(x).

4.1. EVALUATION MODEL

We now determine the robustness distribution of the training-(x) and test data (t) for both CIFAR-10 and the Online News Popularity Dataset (short: ONP) to cover the classification and regression setup. In each case, we identify subsets of x and t stemming from the extremes of the respective robustness distribution hoping to amplify any possible effects. Then we use deterministic models (K-Nearest Neighbour=:KNN) and the non-deterministic models (Convolutional Neural Net-works=:CNNs, Random Forest=:RF) to analyse performance in terms of the different subsets of x and t. For this purpose, we measure accuracy and loss that indicate how much and how well a model can generalise. "Best" performances are displayed in boldface. The algorithms are noted in B as they are partially based on additional results in A. We also extended the analysis to include the MNIST dataset in F. The CIFAR-10 dataset (Krizhevsky, 2012) consists of 60,000 32 2 -pixel RGB-images evenly split into the classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. Figure 2 provides visual examples. The ONP dataset consists of metadata of over 39,000 articles published by Mashable (Fernandes, 2015) split into 58 predictive attributes (e.g. "number of words", "data channel", "polarity of positive/negative words",...), 2 non-predictive ("URL", "timedelta") and 1 goal field ("shares"). Although the latter is traditionally regarded as a regression dataset, one can associate a pseudo-classification task with it, where we classify popular (≥ 1400 shares, comp. Fernandes (2015) ) and unpopular articles based on the prediction of shares. For almost all ONP-subsets, the binary labels are equally distributed (rate is 1:1). For CIFAR-10 we used a KNN classifier and two different CNN architectures, one "small" CNN and the more complex ResNet-56 model (see C.1 for details). For ONP, we used a KNN regressor as well as an RF regressor (see C.2 for details). Neither for CIFAR-10 nor for ONP we used cross-validation due to the high computational effort of computing sample robustness values for each newly formed training set x and test set t. One could determine R x∪t (x) for all samples beforehand and then apply cross-validation to allow for feasibility, but these robustness values would no longer be independent of each other. For the small CNN we first estimated its overfitting threshold for the different training(sub)sets, for the ResNet-56 we measured after how many epochs validation loss did not decrease further (using callbacks). In both cases, we used t as the validation set (instead of held-out data) to emphasise the impact of the different robustness distributions in an artificial best-performance setup. The RF regressor we trained using different numbers of trees ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9} and both Mean Absolute Error (MAE) and Mean Squared Error (MSE) as loss criteria. The KNN classifier and regressor we constructed for K ∈ {1, 2, 3, . . . , 15} and both uniform weights ("uni") and weights defined by Euclidean distance • 2 ("dist"). The non-deterministic models we trained 25 times and discarded the ten highest and lowest loss-and accuracy values. Finally, we averaged the remaining five to avoid focus on individual performances. The details for each model are in C.3.

4.2. CLASSIFICATION CENTRIC ANALYSIS (CIFAR-10)

Consider the feature space FS := [0, 1] 3•32 2 with Euclidean metric • 2 and the CIFAR-10 dataset of 50,000 training images x and 10,000 test images t labelled as k ∈ {airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck}. We regard the label maps as functions y k : x → R, where y k (x) is 1 if the label of x is k and 0 otherwise. The target space R is equipped with its Euclidean norm making it a Banach space. Using Algorithm 2 (comp. B) we determined the robustness of all samples in both x and t, respectively. Afterwards we collected subsets of least-and most robust samples, the distribution of which is shown in Table 1 . The subscript defines the subset-size in thousands (number) and whether they belong to the least-(L) or most (M) robust samples of x or t. As an example, x L40 stands for the 40,000 least robust samples of the training set x. The graphs in D.1 compare the label-wise robustness distributions for x and t. The relative label-distributions of both x and t are similar judging by the 50%, 20% and 80% quantiles (black percentage values are almost equal, blue and green values add up to about 100%). In Figure . 2 some of the most-and least robust samples in x are displayed together with their robustness value. Here, the values of the least robust samples of "airplane" and "ship" coincide as one is the closest differently labelled image with respect to the other (and vice versa). The same holds for the least robust images of "bird" and "deer". In this classification setup, one can see that a close "visual distance" causes low sample robustness as the difference of labels is either 1 or 0. More examples are in D.2. . It shows the effect of training and testing on subsets which are more aligned in terms of their robustness distribution discussed in 3.5. Similar results hold for x M40 . When removing half the samples from x, we can see the trade-off between creating more similar robustness distributions and the overall ability of the model to generalise from its training data. Table 3 reveals furthermore that training on the more robust half of x causes less of a bias: loss on the test(sub)sets is more evenly distributed compared to that of models trained on x L25 . This effect is likely caused by the model learning to connect small differences in feature space with large differences in label space. Hence, it upscales predictions of more robust samples that are further away from the less robust data it was trained on. For the ResNet-56, we can see the same relation between training and performing on subsets of different robustness as for the small CNN. However, there is no (partial) improvement over the baseline. We attribute this to the different learning structure of a residual neural network for which the trade-off mentioned above may behave differently.

4.2.2. KNN

For KNN, the highest accuracy in every case was observed for K=1foot_2 . Table 5 shows the performance matrix for all data(sub)sets (note that for K = 1 both weights coincide). D.4 displays the accuracy curves for K ∈ {1, . . . , 15} and both weights using x, x L25 and x M25 as reference sets. Observations (KNN): the highest accuracy on the most-and least robust test(sub)sets is achieved using the reference sets which are the most similar in terms of their robustness distribution, comp. 3.5. Whereas accuracy on t L2 did not decrease by removing more robust samples, it increased on t M2 when removing less robust samples. This effect and the increased accuracy on t by using x L40 we attribute to the removal of natural adversarial examples as discussed in 3.6.

4.3. REGRESSION CENTRIC ANALYSIS (ONLINE NEWS POPULARITY DATASET)

The Online News Popularity dataset (Fernandes, 2015) was split randomly into training set x and test set t consisting of 32,000 and 7,644 samples, respectively. Each sample x ∈ x ∪ t is an element in FS := [0, 1] 58 , where we use the Euclidean metric • 2 again. The label map y : x → R (where R is equipped with its Euclidean norm as well) assigns to each sample a normalised number of shares in [0, 1]. This time we used algorithm 1 (comp. B) to calculate the robustness for each element in x and t (again independently). We then defined different subsets of both x and t using the same notation as for the CIFAR-10 set. The sets t L and t M consist of the least-and most robust half of t; the sets t l and t m consist of the 500 least-and most robust samples of t L and t M , respectively. Figure 3 displays the relative distribution of shares for all data(sub)sets. The black dashed line divides samples into those below 10,000 shares on the left side and those above 10,000 shares on the right (cumulated at 110). The values in the middle display the relative amount of samples with shares above this threshold. For most of the data(sub)sets about 5% of the samples have more than 10,000 shares. For both x M2 and t m the relative amount is about 9.4%. Conversely, while 7.6% of samples in x L2 have more than 10.000 shares, this holds for only 0.34% of t l (i.e. less than half the relative amount). Indeed, this implicates that the label-distribution of the ≈ 6% 4 least robust samples in x is quite different from that in t. E.1 compares the overall robustness distribution of x and t. 

4.3.1. KNN

Table 6 shows the performance matrix (LOSS \ ACC) for uniform ("uni")-and distance ("dist") weights, as well as the particular choice of K where ACC was highest on t. Table 7 shows the performances and the specific K (in brackets) for which ACC was the highest on each test(sub)set, respectively. E.2 displays the performance graphs. Table 6 : "uni" (left) / "dist" (right) weights where ACC was highest on t. x x L24 x M24 3 4 4 -0.1456 \ 0.5900 -0.1377 \ 0.5959 -0.3945 \ 0.5907 -0.0869 \ 0.5900 -0.0940 \ 0.6049 -0.4290 \ 0.5942 -0.3863 \ 0.5900 -0.3169 \ 0.5869 -0.2531 \ 0.5871 0.0087 \ 0.5900 -0.0046 \ 0.6140 -0.1366 \ 0.6020 -0.1255 \ 0.6320 -0.2217 \ 0.6480 -0.0544 \ 0.6500 in x and t (comp. x L2 and t l in Figure 3 ). The optimal K for each test(sub)set varies between 3 and 4, except for t m where it is nearly 3-4 times as high. Moreover (and in contrast to KNN on CIFAR-10), this causes an increase in accuracy of about 2% -5% in accordance with our theoretic explanation in 3.6. x x L24 x M24 k 3 4 4 t -0.

4.3.2. RANDOM FOREST

Table 8 shows the performances matrix (LOSS \ ACC) and the number of trees where ACC on t was the highest. Table 9 shows the performances and numbers of trees (in brackets) for which ACC was the highest on each test(sub)set, respectively. E.3 displays the performance graphs. x M24 -0.1867 \ 0.5951 (5) -0.2884 \ 0.5907 (4) -0.1760 \ 0.5920 (3) t -0.2098 \ 0.5973 (3) -0.1583 \ 0.5964 (4) -0.1006 \ 0.5974 (3) t L -0.4461 \ 0.5922 (5) -0.8000 \ 0.5867 (3) -0.3172 \ 0.5899 (4) t M 0.0065 \ 0.6316 (3) -0.0198 \ 0.6288 (3) 0.0199 \ 0.6244 (5) t l -0.2920 \ 0.6620 (7) -0.2708 \ 0.6512 (9) -0.2054 \ 0.6620 (8) tm x x L24 x M24 -0.1148 \ 0.5953 (7) -0.2796 \ 0.5909 (6) -0.1305 \ 0.5928 (4) -0.0348 \ 0.6006 (9) -0.1617 \ 0.5963 (4) -0.0616 \ 0.5968 (4) -0.5619 \ 0.5925 (4) -0.9062 \ 0.5873 (6) -0.2197 \ 0.5895 (7) -0.0003 \ 0.6352 (6) -0.0034 \ 0.6356 (7) 0.0172 \ 0.6288 (5) -0.3217 \ 0.6692 (8) -0.6496 \ 0.6520 (9) -0.2261 \ 0.6672 (8) Observations (RF): using 75% of the original training data correlates with a a slight decrease in performance (independent of the metric) and a smaller optimal number of trees for x M24 than for x L24 . When training on x L24 , LOSS on t M and t m increases significantly. This is likely caused by the model learning to connect small changes in feature space with large changes in label space. Hence, it tends to overshoot predictions for more robust data, which is also in accordance with our observations in Table 3 . As for KNN, we can see two things: (i) a large difference in ACC on t l and t m and (ii) the optimal hyperparameter being significantly higher for t m . The first we attribute again to the different distributions of x L2 and t l ; for the second we expect a similar explanation as for KNN (comp. 3.6 ).

5. CONCLUSION AND FUTURE WORK

We introduced the concept of sample robustness for measuring how sensitive elements of a dataset are towards label-changing perturbations. We provided a theoretical motivation and analysed the robustness distribution of different datasets, as well as the connection to model performance. In concordance with our theoretical analysis, we found that it is possible to boost performance on specific test(sub)sets by choosing training(sub)sets exhibiting similar robustness distributions. Empirical results, however, indicate that there is a model-dependent trade-off between discarding samples to align these distributions and the general ability of a model to generalise from its training-or reference data. Finally, we found that optimal hyperparameters may also depend on the robustness of both the training-and test set. Possible future research directions are: (i) expanding experiments (using cross-validation, more datasets and models, different metrics, etc.); (ii) analyse optimal modelhyperparameters as functions h(x AB , t AB ) in terms of training-(x AB ) and test(sub)sets (t AB ) of different parts of the robustness spectrum; (iii) explore the relation between training on more-or less robust data and model-robustness, e.g. susceptibility to adversarial examples (Szegedy et al., 2013) .

A ADDITIONAL THEORY AND PROOFS:

A.1 PROPOSITION: Let L(y) be the Lipschitz constant of the label map y : x → T S. The following equality holds: max x∈x Q x (x) = L(y) Proof. One has L(y) ≥ max x∈x Q x (x) by construction. Furthermore, as x is compact: L(y) = max x =x y(x) -y(x) d(x, x) = y(x 0 ) -y(x 0 ) d(x 0 , x0 ) for some x 0 = x0 such that y(x 0 ) = y(x 0 ). Thus: y(x 0 ) -y(x 0 ) d(x 0 , x0 ) ≤ Q x (x 0 ) ≤ max x∈x Q x (x) A.2 PROPOSITION: R x (x) is independent of rescaling the label map y. Proof. Suppose we have a label map ỹ := ay for a ∈ R -{0}, then: Rx (x) := ỹ ∞ max x∈Γx(x) ỹ(x) -ỹ(x) d(x, x) + ỹ ∞ = |a| y ∞ |a| max x∈Γx(x) y(x) -y(x) d(x, x) + |a| y ∞ = R x (x) A.3 PROPOSITION: Let x ∈ x with reach r(x). If y(x) ∈ {0, e} for all x ∈ x, where e lies on the unit sphere in T S (i.e. y is a binary label map), then: R x (x) = r x (x) r x (x) + 1 Proof. It holds that: R x (x) = max x∈Γx(x) y(x) -y(x) d(x, x) + 1 -1 = max x∈Γx(x) 1 d(x, x) + 1 -1 = 1 min x∈Γx(x) d(x, x) + 1 -1 = 1 r x (x) + 1 -1 = 1 + r x (x) r x (x) -1 Algorithm 2 Calculate R x (x) (Binary y) Require: y(x) = {0, e} with e = 1 y ← y(x) r ← max x for x i ∈ x do y i ← y(x i ) if y = y i then r ← min{r, d(x, x i )} end if end for print r r + 1 C MODELS C.1 CIFAR-10 SMALL CNN The small CNN architecture was taken initially from the Keras homepage 5 and consists of two convolutional blocks, each built using two convolutional layers with kernel size (3,3) and RELU activation function, followed by MaxPooling with pool size (2,2) and Dropout ratio of 0.25. After this, there are two dense layers (512/RELU and 10/Softmax) with a Dropout ratio of 0.5 in between. It was compiled using the RMSpropo optimizer with learning rate 10 -4 and decay of 10 -foot_4 . We used the categorical cross-entropy as our loss function and trained the model with a batch size of 32.

RESNET-56

The ResNet-56 architecture (He et al., 2016) was also taken directly from the Keras homepage 6 . We used n=9, batch sizes of 32 and the pre-defined learning rates of 10 -3 and 10 -4 .

KNN

For the KNN classifier we used the scikit-learn pre-defined algorithmfoot_5 with both uniform weights ("uni") and weights defined by the Euclidean metric • 2 ("dist"). For the KNN regressor we used the scikit-learn pre-defined algorithmfoot_6 with both uniform weights ("uni") and weights defined by the Euclidean metric • 2 ("dist"). The loss function is defined as LOSS := 1 - i (y true i -y pred i ) 2 i y true i -MEAN(y true ) 2 ∈ (-∞, 1], where 1 is the optimum. Accuracy is based on the pseudo-classification task (see 4.1).

RANDOM FOREST

To construct a random forest regressor we also used the scikit-learn pre-defined algorithmfoot_7 . We measured the same loss as for KNN. Again, accuracy is based on the pseudo classification task (see 4.1).

C.3 PROCEDURE DETAILS FOR THE NON-DETERMINISITC MODELS SMALL CNN

For the small CNN, we trained multiple models for an increasing amount of epochs e until we could roughly pinpoint the overfitting-threshold e over on the test set t for every training(sub)set. Then, in a second session, we trained 25 independent models (always starting with different weights) for e over epochs and collected their performances on all of the test(sub)sets t, t L5 , t M5 , t L2 , t M2 in five accuracy-and five loss-lists, respectively. From each list, we discarded the ten highest and the ten lowest values of the 25 and then averaged the remaining five in order to avoid focus on individual model performance. After this we repeated the procedure for e k := e over ± 5k, k ∈ Z, epochs until we found a local loss-minimum on t away from the lowest and highest number of epochs in each case. However, it is still possible to fall victim to the stochastic behaviour of neural networks, and in some cases, we repeated second sessions to account for this fact. The optimal number of epochs may vary in a ±5 vicinity. Also, since double-descent has been discovered (Nakkiran et al., 2020) , one may always question the optimal overfitting threshold.

RESNET-56

The ResNet-56 models were trained 15 times on each of the different training(sub)sets for 80 epochs using a learning rate of 10 -3 . Here, we used callbacks to save the best weights and to identify the particular number of epochs for which loss on t was the lowest. We then picked an individual epoch-threshold from {5k | k ∈ N} such that at most 3 of the 15 values lay above it (to account for outliers). Finally, we conducted second sessions similar to those for the small CNN, where we (i) trained the models for the number of epochs indicated by these upper thresholds, (ii) saved the best weights, (iii) rebuilt the models with these weights and (iv) topped everything off with an additional epoch of training using a lower learning rate of 10 -4 . The epoch-history throughout training the 15 models per training(sub)set is shown below. The number in the brackets is the rounded average of the 15 values after we discarded the highest-and lowest three entries. • x: [13, 17, 20, 12, 12, 15, 15, 18, 13, 19, 12, 23, 23, 17, 8 ]  • x L40 : [ 9, 17, 14, 15, 12, 18, 13, 16, 11, 12, 64, 7, 13, 12 , 9 ] (13) • x M40 : [14, 11, 21, 11, 9, 10, 77, 15, 11, 14, 10, 8, 14, 8, 13 ]  • x L25 : [12, 7, 9, 14, 12, 77, 9, 14, 12, 21, 5, 7, 14, 7, 11 ]  • x M25 : [ 6, 15, 9, 13, 13, 9, 7, 10, 15, 11, 13, 17, 10, 10, 9 ] (11)

RANDOM FOREST

For every number of trees, we simply used the same second session procedure as for the small CNN, tailored for the ONP training-and test(sub)sets. Here, the average is determined using the small CNN and the same second session procedure described in C.3. Notably, the background of each animal or object displayed in the image impacts robustness as differently coloured surroundings will significantly increase distance w.r.t. • 2 . Comparing the robustness and average model probability for the "bird"-and "frog" images, there seems to be no definitive relationship between those values. We trained a single small CNN (see C.1) on x for 85 epochs (which approximately showed the average performance in Table 2 ) and monitored its performance throughout the process. The graphs below show the loss and accuracy values for all test(sub)sets. From the beginning, the model did perform better on the more robust half of the test set t M5 than on the less robust half t L5 . Loss on t L2 is almost always the least of all five, though this does not necessarily cause a high accuracy. Indeed, the subset on which the model has the highest accuracy changes every few epochs from t L2 to t M2 and vice versa throughout the whole training process. The authors of Papernot et al. (2016a) noticed that when crafting source-target pairs of adversarial examples from the MNIST test set, "classes "0", "2" and "8" are hard to start with, while classes "1", "7" and "9" are easy to start with". Interestingly, comparing the compositions of the 2000 leastand most robust samples we can see that the three most prominent classes in t L2 were "1", "7" and "9" (and close thereafter "4"), while the two most prominent classes in x M2 were "0" and "2" (followed by "6" and "3" before "8"). Figure 12 displays the most and least robust samples of each class together with their sample robustness and the average probability of a shallow CNN with an average accuracy of about 98% (25 models were trained independently and accuracy values were collected similarly to the second session for the small CNN on CIFAR-10). The boldness of the written digit has a beneficial impact on its robustness as it will increase Euclidean distance significantly. We can also see that the least robust image of a "3" presents itself as a mislabelled "9". Indeed, this makes sense as distance in feature space to another image of a "9" in x is relatively small. One may be tempted to regard sample robustness as a tool of anomaly detection (Chalapathy & Chawla, 2019; Beggel et al., 2019) as it rightfully identifies the mislabelled image of a "9". To underline that it is an utterly intrinsic and metric-dependent concept, consider the set of "handwritten digits" in Figure 13 and their respective robustness values (note that the image of the "4" is not missing but plain white). The relatively high value of the "3" being a "4" may be caused by the missing normalization of size and position (as was done for the MNIST set). 0.9028 (0) 0.9139 (1) 0.8923 (2) 0.8071 (3) 0.7812 (4) 0.9441 (5) 0.8984 (6) 0.9139 (7) 0.9152 (8) 0.9319 (9) Figure 13 : Robust samples of "handwritten digits" labelled as the number in the brackets.



One may think of a machine learning model F trained on x making predictions on t. Training a machine learning model produces exactly such a sequence Fn. We attribute this to the following fact: a large "closest" distance to differently labelled samples does not ensure that there exist close samples of the same label after all. This effect is amplified for image data, where changing the background does not necessarily change the label of a displayed object, but will increase Euclidean distance significantly. It was recently removed. https://keras.io/zh/examples/cifar10 resnet/ https://scikit-learn.org/stable/modules/generated/sklearn.neighbors. KNeighborsClassifier.html https://scikit-learn.org/stable/modules/generated/sklearn.neighbors. KNeighborsRegressor.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. RandomForestRegressor.html



Figure1: Images from the CIFAR-10 dataset labelled "horse" and "cat", respectively.

Figure 2: Most-(left) and least robust samples in x labelled "airplane", "ship", "bird", "deer".

Figure 3: Relative distribution of shares for all training-and test(sub)sets.

Figure 4: Relative class-wise sample robustness distribution for x and t.

Figure 5: Most-(upper row) and least robust samples (lower row) in the CIFAR-10 training set.

Figure 6: Epoch-wise learning behaviour of a small CNN trained on the CIFAR-10 training set x approximately expressing the loss inTable 2.

Figure 7: Epoch-wise learning behaviour of a small CNN trained on the CIFAR-10 training set x approximately expressing the accuracy in Table2.

Figure 9: Relative robustness distributions for x and t. The green dashed line marks the 65% threshold. For x, 1272 samples are below this robustness level (≈ 4%); for t, 267 (≈ 3.5%).

Figure 12: Most-(upper row) and least robust (lower row) samples of each label in the MNIST training set ("0" -"9" from left to right).

Sample distribution per label of all data(sub)sets.

Small CNN performance matrix and number of epochs until overfitting commences.

Small CNN performance matrix with re-weighted LOSS (same epochs as in Table.2).

ResNet-56 performance matrix where loss was lowest on t for the amount of epochs noted.

1NN accuracy matrix.

"uni" (left) / "dist" (right) weights where ACC was highest for each test(sub)set.

MAE (left) / MSE (right) as loss-criteria where ACC was highest on t.

MAE (left)  / MSE (right) as loss-criteria where ACC on each test(sub)set is highest.

Sample distribution per label of all data(sub)sets.

A.4 LEMMA:

With the notions of 3.5 it holds:Proof. Using the reverse triangle inequality twice:To see the first inequality, letand note that V = |V | and W = |W |. Then, by using the uniform norm on X z defined aswe see that the first inequality is indeed a consequence of the reverse triangle inequality:Taking the maximum over z ∈ x ∪ t yields the statement.

B ALGORITHM (SAMPLE ROBUSTNESS)

Let x ⊂ FS be a dataset in some feature space with metric d and label map y taking values in some target space T S with norm • . For a sample x ∈ FS with label y(x) such that y(x) ≤ max x∈x y(x) one can calculate its robustness using the algorithms below. Note that the second algorithm is a special case of the first. 

F MNIST

The MNIST data set (Lecun et al., 1998) consists of 60,000 training-(x) and 10,000 test-(t) 28 2pixel greyscale images of handwritten digits (we used the same mathematical notions as for the CIFAR-10 set). Table 10 shows the label-distributions of different data(sub)sets (again, the notation is the same as for the CIFAR-10 set). The relative label-distributions of both x and t are similar judging by the 50%, 20% and 80% quantiles (black percentage values are almost equal, blue and green values add up to about 100% except for class "1"). Overall, images displaying the digits "1", "4", "7" and "9" were very prominent in x L48 , x L30 , t L5 , t L2 , whereas images in x M48 , x M30 , t M5 , t M2 mainly consist of "0"s, "2"s and "6"s.

