HEATING UP DECISION BOUNDARIES: ISOCAPACITORY SATURATION, ADVERSARIAL SCENARIOS AND

Abstract

In the present work we study classifiers' decision boundaries via Brownian motion processes in ambient data space and associated probabilistic techniques. Intuitively, our ideas correspond to placing a heat source at the decision boundary and observing how effectively the sample points warm up. We are largely motivated by the search for a soft measure that sheds further light on the decision boundary's geometry. En route, we bridge aspects of potential theory and geometric analysis (Maz'ya (2011); Grigor'Yan & Saloff-Coste ( 2002)) with active fields of ML research such as adversarial examples and generalization bounds. First, we focus on the geometric behavior of decision boundaries in the light of adversarial attack/defense mechanisms. Experimentally, we observe a certain capacitory trend over different adversarial defense strategies: decision boundaries locally become flatter as measured by isoperimetric inequalities (Ford et al. ( 2019)); however, our more sensitive heat-diffusion metrics extend this analysis and further reveal that some non-trivial geometry invisible to plain distance-based methods is still preserved. Intuitively, we provide evidence that the decision boundaries nevertheless retain many persistent "wiggly and fuzzy" regions on a finer scale. Second, we show how Brownian hitting probabilities translate to soft generalization bounds which are in turn connected to compression and noise stability (Arora et al. ( 2018)), and these bounds are significantly stronger if the decision boundary has controlled geometric features.

1. INTRODUCTION AND BACKGROUND

The endeavor to understand certain geometric aspects of decision problems has lead to intense research in statistical learning. These range from the study of data manifolds, through landscapes of loss functions to the delicate analysis of a classifier's decision boundary. In the present work we focus on the latter. So far, a wealth of studies has analyzed the geometry of decision boundaries of deep neural networks (DNN), reaching profound implications in the fields of adversarial machine learning (adversarial examples), robustness, margin analysis and generalization. Inspired by recent isoperimetric results and curvature estimates (Ford et al. (2019) ; Moosavi-Dezfooli et al. (2019) ; Fawzi et al. (2016) ), we attempt to provide some new aspects of decision boundary analysis by introducing and studying a corresponding diffusion-inspired approach. In this note the guiding idea is to place a heat source at the classifier's decision boundary and estimate its size/shape in terms of the amount of heat the boundary is able to emit within a given time (Fig. 1 ). The goal is to extract geometric information from the behavior of heat transmission. This technique of heat content seems well-known within capacity/potential theory and has led to a variety of results in spectral analysis relating heat diffusion and geometry, Jorgenson & Lang (2001) ; Grigor'Yan & Saloff-Coste (2002) ; Maz'ya (2011) . However, working with such heat diffusion directly in terms of the corresponding differential equations is impractical. To this end, we note that, due to Feynman-Kac duality, the heat estimates are convertible to Brownian motion hitting probabilities. Thus we circumvent the need for solving intractable differential equations and instead are able to employ a straightforward Monte-Carlo sampling scheme in the ambient data space (Section 3). Background on defense training We apply the above analysis in the context of adversarial machine learning (Section 4) where one studies the interaction between an adversary and a ML system. One of the goals of the subject is to design attack/defense training strategies improving the robustness of a given ML model -in the present work we are interested in how adversarial/noise defense training are reflected geometrically. Many different metrics to estimate robustness have been proposed: on one hand, there is adversarial robustness (the probability that error samples lie very near a given data point x); on the other hand, there is corruption robustness (the probability of getting an error sample after perturbing a given data point x with some specified noise). In our context, heat diffusion naturally suggests a capacitory robustness metric: this metric is built upon the probability that Brownian motion started at a given data point x will hit error samples within a given time window. One can perceive this metric as a combination of adversarial and noise robustness (Brownian motion has continuous paths and specified stopping time determined by boundary impact). In this perspective, our work is aligned with studies of other robustness metrics and curvature results (cf. Fawzi et al. (2016) for a "semi-random" projection robustness and relations to curvature). We study the capacitory metric on the well-known CIFAR10 and MNIST datasets and observe that defense training techniques may either yield a certain (although not substantial) decrease (noise training) or fail to have a significant effect on continuous Brownian attacks overall. Surprisingly, in both cases the studied capacitory metric does not converge to the corresponding value as in the case of a flat decision boundary. Due to our comparison statements and curvature considerations, this means that locally around clean data points the geometry is in general flattened out but may still retain complexity and substantial areas of (small) non-vanishing curvature. In other words, from the point of view of our heat diffusion metrics, decision boundaries locally exhibit non-flat behaviour. Background on generalization estimates Finally, we observe that the collected heat/hittingprobability metrics can further be used to obtain generalization bounds where, in a nutshell, one evaluates the performance of a model on unseen data in terms of the performance over a given sampled data, the model's expressiveness, dimension, etc. In this regard, we view decision boundary heat diffusion traits as an indicator of how noise-stable a given model is -this relates Brownian hitting bounds with recent compression-based generalization techniques in the spirit of Arora et al. (2018) ; Suzuki et al. (2018; 2020) . More precisely, we proceed in two steps: first, we construct a "smaller" compressed model that is almost equivalent to the initial one in an appropriate heat-theoretic way; second, we obtain generalization estimates for the smaller model in terms of the decision boundary hitting probabilities (computed on the empirical dataset). Furthermore, the bounds are significantly improved under additional geometric assumptions on the decision boundary of the initial model.

Additional related work

The interplay between heat diffusion and geometry lies at the heart of many topics in geometric analysis and spectral theory (cf. Jorgenson & Lang (2001) ; Grigor'Yan (2001) for a far reaching overview). Some direct applications of heat diffusion techniques to zero sets of eigenfunctions are seen, for example, in Steinerberger (2014) ; Georgiev & Mukherjee (2018a; b) . The literature on adversarial ML is vast: to name a few central works in the field, we refer to Dalvi et al. (2004) ; Biggio & Roli (2018) ; Szegedy et al. (2014) . Much effort has been invested in designing and understanding strategies that will render a model robust to various attacks (e.g. Madry et al. (2018) ; Carlini & Wagner (2017) ). In particular, the geometry of decision boundaries has been the focus of many works in the subject leading to breakthroughs in curvature estimates, boundary flatness and robustness, schemes for detecting boundary complexity, proposing adversarial attacks/defenses and diffusion based techniques towards constructing decision boundary from partially pre-labelled data (e.g. Ford et al. (2019) ; Fawzi et al. (2016; 2017; 2018) ; Dezfooli et al. (2018) ; Moosavi-Dezfooli et al. (2019) ; Karimi et al. (2019) ; Karimi & Tang (2020) ; He et al. (2018) ; Szlam et al. (2008) ). The theory of generalization bounds has formed a classical main line of ML and statistical inference research (Vapnik (1999)) . In this direction central questions address the generalization properties of heavily over-parametrized deep neural network models. According to some classical VC-dimension results such models should overfit the data and generalize poorly. Extensive research effort has been invested in developing appropriate sharper techniques to explain generalization of DNN models: on one hand there are the methods based on norm estimation whose bounds are not explicitly using the number of the network's parameters (see Golowich et al. (2019) ; Neyshabur et al. (2015; 2018) ; Wei & Ma (2019) ; Bartlett et al. (2017), etc) . On the other hand, recent results based on compression and VC-dimension can lead to sharper bounds (Arora et al. (2018) ; Suzuki et al. (2018; 2020) ).

2. CONTRIBUTIONS, CONTEXT AND PAPER OUTLINE

An outline of our essential contributions is given as follows: 1. We analyze decision boundary geometries in terms of novel heat diffusion and Brownian motion techniques with thorough theoretical estimates on curvature and flattening. 2. We show, both theoretically and empirically (in terms of adversarial scenarios on stateof-art DNN models), that the proposed heat diffusion metrics detect the curvature of the boundary; they complement, and in some respects are more sensitive in comparison to previous methods of boundary analysis -intuitively, our heat driven metrics are sharper on a finer scale and can detect small-scale "wiggles and pockets". As an application, we are thus able to provide evidence that adversarial defenses lead to overall flatter boundaries but, surprisingly, the heat traits do not converge to the corresponding flat-case, and hence, finer-scale non-linear characteristics (e.g. "wiggles and pockets") are persistent. 3. Moreover, the preservation of "wiggles and pockets" means that susceptibility to naive Brownian motion attacks is not significantly decreased via adversarial defense mechanisms. 4. Finally, we introduce a novel notion of compression based on heat diffusion and prove that stability of heat signature translates to compression properties and generalization capabilities. In terms of context, the present note is well-aligned with works such as Ford et al. (2019); Dezfooli et al. (2018) ; Fawzi et al. (2016; 2018) . Among other aspects, these works provide substantial analysis of the interplay between geometry/curvature and adversarial robustness/defenses -in particular, we use some of the these tools (e.g. isoperimetric saturation) as benchmarks and sanity checks. However, in contrast, in our work we provide a non-equivalent technique to address decision boundary geometry for which we provide an extensive theoretical and empirical evaluation with insights on the preservation of finer-scale traits. Intuitively, previous distance-based geometric methods could be considered as a "coarser lens", whereas the present heat-diffusion tools appear to be much more sensitive. As a large-scale example, Brownian particles emanating from a point are able to distinguish between a decision boundary which is a hyperplane at distance d and a decision boundary which is a cylinder of radius d wrapping around the point. Our notion of compression is inspired by Arora et al. (2018) , and establishes a connection between the Johnson-Lindenstrauss dimension reduction algorithm with diffusion techniques. Furthermore, we bridge the proposed heat-theoretic techniques with generalization bounds in the spirit of Arora et al. (2018) ; Suzuki et al. (2020) . In particular, this shows that overall lower heat quantities at sample points imply better generalization traits. A step-wise road map of the present work is given below: • (Subsection 3.1) We start by discussing what heat diffusion is and how it is to be evaluated -here we discuss that, via Feynman-Kac duality, one can essentially work with Brownian motion hitting probabilities. • (Subsections 3.2 and 3.3) We introduce the isocapacitory saturation τ -a heat-theoretic metric that will be used to estimate boundary flatness. Moreover, here we emphasize the properties of τ such as relations to curvature (Proposition 3.1) and the novel information obtained from heat theoretic methods in comparison to previous distance-based ones. • (Subsection 3.4) We compute τ for certain geometric model cases such as hyperplanes, cones, wedges and "spiky" sets (Lemmas 3.2 and 3.3). This allows us later to evaluate how much a given geometry resembles these model cases. • (Section 4) Next, we are in a position to evaluate and compare τ for decision boundaries of DNNs. We experimentally illustrate the effect of adversarial defense mechanisms and noise robustness on τ (PGD/FGSM on MNIST and CIFAR-10). • (Section 5) We prove that heat transmission relates to generalization bounds (Propositions 5.1 and 5.2) -in particular, lower levels of heat at sample points yield sharper generalization bounds. Finally, we complete the discussion by informally stating our compression scheme. • (Appendix) Our methods leverage several tool sets extensively. For this reason our goal in the main text is to only collect and showcase the techniques and results. However, the thorough in-depth analysis is provided in the Appendix where the reader can find all relevant proofs and further background and references.

3.1. GEOMETRY SEEN THROUGH BROWNIAN MOTION AND DIFFUSION

Notation Let us consider a dataset X := {(x i , y i )} m i=1 consisting of feature points x i ∈ R n and their corresponding labels y ∈ {1, . . . , k}. Let us suppose that a k-label classifier f : R n → R k labels a point x ∈ X as arg max i∈[1,k] f (x)[i]. The decision boundary of f is given by N := {x ∈ R n |f (x) has two or more equal coordinates} (cf. Fig. 2 ). Assuming f is sufficiently regular, one thinks of N as a collection of hypersurfaces in R n . Further, for a given target label y we define the target (error) set E(y) as the set of points on which the classifier's decision is different from y, i.e.

E(y)

:= {x ∈ R n | arg max i∈[1,k] f (x)[i] = y} (here we remark that if arg max is set-valued at x with several coordinates obtaining the maximum value, then by convention x is contained in E(y)). Clearly, if a given data sample (x 0 , y 0 ) ∈ X is correctly classified by f , then x 0 is outside of the error set E(y 0 ). Finally, we note that the boundary of E(y) coincides with E(y) ∩ N and moreover, N is the union of the boundaries of E(y) for all labels y. Feynman-Kac duality and hitting probabilities As mentioned in Section 1 we wish to study a heat diffusion process where we place a heat source at the decision boundary N : formally, this is given by a heat equation with appropriate initial and boundary conditions (Appendix, Subsection A.2). Avoiding the impracticality of working with the differential equations directly, we bring forward the theorem of Feynman-Kac that relates the solution of the diffusion process to hitting probabilities of Brownian motion (Appendix, Subsection A.3) . By way of notation, for an open set U ⊆ R n , let ψ U (x, t) denote the probability that a Brownian particle starting at the point x will enter U within time t. In other words, ψ U (x, t) := P ω∼W [ ∃ t 0 ∈ [0, t] | ω(t 0 ) ∈ U ] , x ∈ X , where ω denotes a Brownian motion defined over the interval [0, t] that follows the standard Euclidean Wiener distribution. The amount of heat that a point x receives from N within time t is comparable to the hitting probability that a Brownian particle starting at x will impact the boundary within time t (cf. Fig. 2 ). Provided that x is correctly classified this is equivalent to the probability of impacting the decision boundary. In general, we evaluate ψ E(y) (x, t) (which we often denote by ψ(x, t) by minor abuse of notation) through direct sampling; however, in some model cases, e.g. E(y) being a half-space, a spherical shell or a conical set, ψ(x, t) has a concise closed form (Subsection 3.4 below) that can be evaluated analytically. This allows us to easily measure deviations and compare the heat imprint of N to particular model cases. Local analysis and set-up As mentioned above our analysis is local. For each clean data point x we consider a ball B(x, r) centered at x with radius r and perform all our computations there. In particular, a free Brownian motion starting at x and defined over a maximal time interval [0, t] will on average travel a distance of √ nt (Appendix, Subsection A.1). This suggests to couple r and the maximal Brownian running time t via r = √ nt (cf. Fig. 2 ), so that, if not stopped by boundary impact, Brownian motion will, on average, reach the sphere ∂B(x, r) by its maximal stopping time.

3.2. AN ISOPERIMETRIC AND ISOCAPACITORY PERSPECTIVE

Isoperimetric results Isoperimetric estimates will be the starting baseline (Ford et al. (2019) ) to detect low levels of curvature and boundary flatness. For some background in isoperimetric results we refer to (Appendix, Subsection A.4). Let us start by defining the relative error volume µ(x, r) := Vol(E(y) ∩ B(x, r)) Vol(B(x, r)) . We recall the so-called Gaussian isoperimetric inequality Borell (1975) ; Ford et al. (2019) : d ≤ - r Φ -1 (µ) √ n , µ ≤ 1/2, where Φ -1 denotes the inverse standard normal c.d.f. and where d = d(x, N f ) denotes the median distance with x varying normally and concentrated in the ball B(x, r), and d = 0 if µ ≥ 1/2. Here the isoperimetric result is rigid in the sense that equality in (3) occurs only if E(y) is a half-space. In Ford et al. (2019) the authors demonstrate that defense training mechanisms lead to decision boundaries that saturate this isoperimetric inequality, i.e. in this isoperimetric sense, the decision boundary N becomes locally closer to being a flat hyperplane. We define the ratio between the LHS and RHS in eq. ( 3) as the isoperimetric saturation.

Isocapacitory results

In our context of hitting probabilities (eq. ( 1)), results in potential theory allows us to prove isocapacitory bounds which are similar in spirit to isoperimetric bounds. More precisely one has: µ(x, r) ≤ c n ψ(x, t) n n-2 , ( ) where c n is an appropriate constant depending on the dimension n, and r = √ nt. The proof relies on potential theory tools (capacity) and can be found in Appendix, Proposition A.3. Motivated by the above isoperimetric saturation results, one of our main goals is to study how µ compares to ψ(x, t). To this end we define the isocapacitory saturation τ as τ (x, r) := ψ(x, t) n n-2 µ(x, r) . ( ) The basic guiding heuristic is that high values of τ indicate that E(y) has a very low volume in comparison to its boundary size and respective heat emission. This is the case whenever E(y) is a very thin region with a well-spread boundary of large surface area -e.g. a set that resembles thin spikes entering the ball B(x, r). In contrast, lower values of τ should indicate a saturation of the isocapacitory inequality (4) and imply that E(y) has a volume that is more comparable to its heat emission -e.g. thicker sets with tamer boundary. To quantify this intuition, we explicitly evaluate τ for some model scenarios (Subsection 3.4).

3.3. THE NOVEL INFORMATION GIVEN BY HEAT DIFFUSION

Distances vs. hitting probabilities As discussed above, several works investigate decision boundaries in terms of distance-based analysis (Ford et al. (2019) ; Fawzi et al. (2016) ; Karimi & Tang (2020) ; Karimi et al. (2019) ). We remark that our analysis based on hitting probabilities augments and extends the mentioned distance-based approaches. Although related, the two concepts are not equivalent. A guiding example is given by E(y) being a dense collection of "thin needles" (Appendix, Subsections A.4, A.5); in such a scenario the average distance to N is very small, as well as the chance a Brownian particle will hit N . On the other hand, if N is a dense collection of hyperplanes, the average distance to N is again small, but Brownian motions almost surely will hit N . In this sense, evaluating hitting probabilities yields a different perspective than is available from distance-based analysis and sheds further light on the size and shape of the decision boundary, particularly with regards to its capacity and curvature features. Isoperimetric vs. isocapacitory saturation Another demonstration of the additional information obtained through τ is given by almost flat shapes in higher dimensions that saturate isoperimetric bounds (Appendix, Subsection A.4). In these scenarios small geometric deformations can have a significant impact on τ , and at the same time almost preserve isoperimetric bounds. In other words τ provides an additional level of geometric sensitivity. We discuss this further in Section 4. The effect of curvature The interplay between curvature of the decision boundary and robustness has been well studied recently, e.g. Fawzi et al. (2016) ; Moosavi-Dezfooli et al. (2019) where various forms of robustness (adversarial, semi-random and their ratio) have been estimated in terms of the decision boundary's curvature. Intuitively, the differential geometric notion of curvature measures how a certain shape is bent. The precise definition of curvature involves taking second-order derivatives which is in most cases impractical. However, in our context we show that the isocapacitory saturation τ implies certain curvature bounds. These statements exploit relations between curvature and volume and lead to pointwise and integral curvature bounds. As an illustration, we have: Proposition 3.1 (Informal). Let (x, y) ∈ X be a data sample. Then, provided that the distance d(x, N ) is kept fixed, larger values of τ locally imply larger pointwise/integral curvature values. A deeper analysis with formal statements and additional details are provided in Appendix, Subsection A.6. The advantages that curvature yields for some types of compression schemes and generalization bounds is also intensely investigated in Appendix, Section B. 3.4 MODEL DECISION BOUNDARIES: HYPERPLANES, WEDGES, CONES AND "SPIKY" SETS Given a certain geometric shape, one is often faced with questions as to how flat or spherical the given geometry is. To this end, a central technique in geometric analysis is comparing to certain model cases -e.g. a sphere, plane, saddle, etc. After having introduced τ and its basic traits we now evaluate it for several model cases (flat hyperplanes, wedges, cones, balls and "spiky" sets). Each of these model cases illustrates a distinguished τ -behaviour: from "tame" behaviour (hyperplanes, balls) to explosion (thin cylinders, "needles and spiky" sets). Hence, having comparisons to these model cases and given an decision boundary, one can, quantify how far away is the given surface from being one of the models. We start by discussing the flat linear case: Lemma 3.2. Let (x, y) be a data sample and suppose that E(y) forms a half-space at a distance d from the given data point x ∈ R n . Then τ (x, r) = 2 Φ - d √ t Vol (B(x, r)) V n (d, r) , where Φ(s) is the c.d.f. for the standard normal distribution, and V n (d, r) is the volume of the smaller n-dimensional solid spherical cap cut-off at distance d from the center of a ball of radius r. The computation uses standard reflection principle techniques. Figure 3 depicts an experimental discussion on Lemma 3.2. Another illuminating model is given by a "spiky" set -e.g. a thin cylinder, which is in some sense the other extreme. We have Lemma 3.3 (Appendix, Subsection A.5) . Suppose that E(y) is a cylinder of height h and radius ρ that enters the ball B(x, r). Then τ ∞ as ρ 0. Figure 3 : A visual depiction of decision boundaries and saturation τ for 5-layer MLP models with 20 and 100 hidden units trained over a planar "circular" dataset (depicted in grey). For each data sample x the ball B(x, r) is selected so that the relative volume µ(x, r) is 0.1. According to Lemma 3.2 a flat decision boundary would correspond to τ ≈ 3.32. (Left) The saturation τ exhibits a bi-modal behaviour with peaks around the values 3 and 4.3. These correspond to data points squeezed between thin elongated regions that locally closely resemble the flat case, or tinier "pockets" with higher curvature, respectively. (Right) The saturation τ is more closely concentrated around 4.3 and, accordingly, the decision boundary mainly consists of smaller "pockets" of higher curvature. Further comparison results for additional model cases are given in Appendix, Subsection A.5.

4. ADVERSARIAL ATTACKS AND DEFENSES

Background and set-up We now analyze how strategies for improving adversarial and noise shift robustness affect the decision boundary's heat diffusion properties. In particular, we keep track of Brownian hitting probabilities ψ and the isocapacitory saturation τ . On one hand, we can view ψ as a capacitory robustness metric against continuous interpolation attacks given by Brownian noise (see also Section 1). On the other hand, Subsection 3.4 indicates how the behaviour of τ reveals deviation from the case of a flat or "spiky" and curvy decision boundary. Our empirical analysis uses the well-known CIFAR10 and MNIST datasets (details, preprocessing and enhancements are given in Appendix, Subsection C.5). For CIFAR10, we used the Wide-ResNet-28-10 ( 

Analysis of results

Recent results (Ford et al. (2019) ; Schmidt et al. (2017) ) have shown qualitative differences between the adversarially robust boundaries of MNIST and CIFAR-10, which also impact the experimental findings in this work. In short, a robust decision boundary is in the MNIST case less spiky in comparison to CIFAR. For more details we refer to Appendix, Subsection C.2. In Fig. 4 we collect the statistics of the WRN and LeNet models on CIFAR10 and MNIST, respectively. On one hand, we confirm previous results (Ford et al. (2019) ; Fawzi et al. (2016) ) implying the "flattening-of-boundary" phenomenon: noisy and adversarial training appear to improve and saturate isoperimetric bounds. Furthermore, the ball B(x, r) realizing relative error volume µ of 1% is on average scaled up for adversarial and, especially, noisy training. On the other hand, an intriguing behaviour is observed for the decision boundary's heat diffusion traits. The isocapacitory saturation τ does not appear to concentrate around the value corresponding to a flat hyperplane: defense training strategies, both FGSM and PGD-based, may not have a significant impact on the behaviour of τ by forcing it to converge to the case of a flat decision boundary (shown as horizontal red punctured line). Put differently, the chance that a continuous Brownian perturbation will find an adversarial example (scaled to the appropriate ball B(x, r)) will not be significantly altered on average (see Appendix, Subsection C.7 for a visual reference). However, it appears that noisy training consistently delivers lower values of τ -intuitively, this is expected as the decision boundary is adjusted in terms of adding Gaussian "blobs", thus naturally being rounder. Geometrically, the sensitivity of τ to small perturbations in almost flat surfaces (Subsection 3.2) indicates that locally around clean (unperturbed) data points an amount of curvature and more complex geometry are still retained. Of course, this amount is not as large as to violate saturation of isoperimetric bounds and robustness comparability results in the sense of Fawzi et al. (2016) . For example, in the case of CIFAR10 a simple geometric model surface that has a similar τ -behaviour (as for the adversarial and noisy training) is given in (Appendix, Subsections A.4, A.5): considering a data point x, an almost flat decision boundary that is concavely bent w.r.t. x with approximate curvature of ≈ 1/(12.3r). These observations reveal finer properties concerning decision boundary flattening due to defense training: in particular, noisy training appears to flatten decision boundaries and slightly bend them concavely w.r.t. to the clean data points. Further results for ResNet models and CNN are provided in (Appendix, Subsection C.7). Spiky sets and control on τ In Fig. 4 large outlying values of τ are filtered out. However, values of τ larger than 10 can occupy up to 1.3% for ordinary training and 2.1%, 2.6% for adversarial, noisy training, respectively. It follows, that the geometry of high-dimensional decision boundaries does not admit too many high-curvature (see also Proposition 3.1) spiky regions of low volume and high heat emission (high surface area) in the sense of Subsections 3.2, 3.4. However, it appears that defense training can increase the number of such spiky regions: one might explain such behaviour by seeing defense training as a bundle of additional geometric conditions that sometimes are not able to agree and thus lead to a more degenerate (singular) geometry. Further, with respect to the initial analysis of Fig. 4 , a natural question is whether one can control τ along with the isoperimetric saturation -ultimately, one hopes to design better decision boundaries (flatter, or appropriately curved Moosavi-Dezfooli et al. ( 2019)) eventually leading to more robustness. However, getting a tight control on τ could be a difficult task. It is, indeed, possible to obtain some basic grip on τ : we trained a LeNet-5 architecture on MNIST that exhibited significantly increased τ values and preserved isoperimetric saturation (statistics are shown as the rightmost boxplot in Fig. 4 ). Similar to many adversarial defenses, the training consisted in augmenting the dataset with attacks given in this case by Brownian paths. However, it seems difficult to force τ to concentrate around the flat-case value, as well as to obtain competitive robustness of the model. On one hand, this is explained via the need to control heat diffusion through Brownian motion -the mentioned naive method is not able to capture the hitting properties sufficiently well; on the other hand, as discussed above heat diffusion properties can be far more sensitive than isoperimetric saturation w.r.t. minor geometric perturbations.

5. GENERALIZATION BOUNDS IN TERMS OF HITTING PROBABILITIES

Compression, noise stability and generalization Recent advances (Arora et al. (2018) ; Suzuki et al. (2018; 2020) ) indicate that generalization can be related to compression and noise stability. The guiding strategy is: (1) a large DNN f that is stable against (layer-wise) noise injections admits an effective compression to a simpler model f which is almost equivalent to f . Intuitively, the noise stability absorbs the defects introduced by compression; (2) concentration results imply generalization bounds for f . Admittedly, the generalization estimate is obtained initially for the smaller model; however, it is also possible to "transfer" the bound to f (see the discussion at the end of this Section). In this context a simple observation is that Brownian motion and its hitting probabilities can be related, respectively, to noise injection and margins of classification: small hitting probability of the decision boundary should indicate "margin-safety" and allow to compress parameters of the model more aggressively. However, in contrast to injecting normal noise, Brownian motion, with stopping time given by boundary impacts, is more delicate and requires further analysis of the decision boundary. In the following we propose a theoretical framework that, we hope, will augment and produce further insights into the interplay between noise stability and generalization bounds. The statements are inspired by the results in Arora et al. (2018) ; Suzuki et al. (2020) and we follow the notation therein. First, we propose several options for goodness of approximation (compression) in the sense of heat diffusion (Appendix, Subsection B.1). We give the following definition: Definition 1. Given a positive real number η, a classifier g is said to be an η-compression of f if ψ Eg(y) (x, γ 2 ) -ψ E f (y) (x, γ 2 ) < η ) for all points x in the training sample, labels y and real numbers γ. Now, as mentioned above we have the following generalization bounds for the compressed model: Proposition 5.1. Let us suppose that f is approximable by g in the sense of Definition 1. Here g ∈ A, where A is a family of classifiers R n → R parametrized by q parameters assuming r discrete values. For a classifier h, let C h (x, y, t) be the event that a Brownian path starting at x hits E h (y) within time t. Then for t 1 ≤ t 2 ≤ T we have L 0 (g) ≤ P (x,y)∼D (C gα (x, y, t 1 )) ≤ P (x,y)∼X (C f (x, y, t 2 )) + η + O q log r m (8) with probability at least 1 -e -q log r and L 0 denoting the expected loss over the true data distribution. Taking t 2 → 0 in (8), one recovers the empirical loss L0 (f ) on the RHS. In other words, the generalization of the smaller model g is controlled by hitting probabilities of the initial model f and corrections related to family capacity. The next natural question is the construction of g. Inspired by Johnson-Lindenstrauss techniques (cf. also Arora et al. (2018) ) we are able to recover the following statement (thorough details are given in Appendix, Subsections B.5, B.6): Proposition 5.2 (Informal). Considering a fully connected feed-forward neural network f where some flatness conditions on the layer decision boundaries are fulfilled, there exists an η-compression g in the sense of Def. 1 whose number of parameters is logarithmically smaller than f . Finally, having the generalization estimates on the smaller model g it is natural to attempt transferring those to the initial model f -in Suzuki et al. (2020) In this Subsection we introduce some basic background on Brownian motion. Definition 2 (Brownian motion). A real-valued stochastic process {ω(t) : t ≥ 0} is called a one-dimensional Brownian motion started at x ∈ R if the following hold: • ω(0) = x, • the process has independent increments, that is, for 0 ≤ t 1 ≤ • • • t m the increments ω(t j ) -ω(t j-1 ) for j = 2, • • • , m are independent random variables, • for t ≥ 0, h > 0, the increments ω(t + h) -ω(t) are normally distributed with expectation zero and variance h, • almost surely, the function t → ω(t) is continuous. The process {ω(t) : t ≥ 0} is called a standard Brownian motion if x = 0. Finally, if ω 1 , • • • , ω n are independent one-dimensional Brownian motions started at x 1 , • • • , x n then the stochastic process ω(t) = (ω 1 (t), • • • , ω n (t)) is called an n-dimensional Brownian motion started at x = (x 1 , • • • , x n ). Remark A.1. The distribution of the standard 1-dimensional Brownian motion ω(t) is normal with mean 0 and variance t. It follows that the RMSD (root mean squared displacement) of the standard n-dimensional Brownian motion is √ nt. Sampling Brownian motion simulation is prescribed directly by Definition 2. Given a step size s, number of steps k we sample a Brownian path as ω(k) := k i=0 sX i , X i ∼ N (0, 1). ( ) By Definition 2, Var[ω(t)] = t, hence the sampling ω corresponds to running a Brownian motion for time t = ks 2 . (10) In particular, the mean displacement of ω is s √ nk. In accordance with the main text, Subsection 3.1 and Fig. 2 , whenever we need to sample Brownian motion contained within the ball B(x, r) for its lifespan [0, t], we will fix the number of steps k (usually, we set k = 400) and adjust the step size s accordingly, so that r = s √ nk. Estimating hitting probabilities A straightforward empirical way to estimate Brownian hitting probability P ω [∃t 0 ∈ [0, t]|ω(t 0 ) ∈ S] of a target set S is to evaluate the steps ω(i), i = 0, . . . , k and check whether ω(i 0 ) ∈ S for some S. Of course, the precision of this computation depends on the number of sampled Brownian paths ω, as well as the step size s and number of steps k. Formal statements on convergence and numerical stability could be obtained, e.g. by means of concentration/Monte-Carlo results (e.g. Proposition B.12 below); however, in practice, in our experiments we mostly worked with the regime k ≈ 10 4 which seemed an acceptable choice in terms of numeric stability and performance. Explicit closed-form computation of hitting probabilities is a non-trivial task, though it is possible for some model cases (main text, Lemma 3.2). Dimension 1 is special, where we have the so-called "reflection principle", which says that P sup 0≤s≤t ω(s) ≥ d = 2 P (ω(t) ≥ d) . ( ) For a proof of this basic statement we refer to Mörters & Peres (2010) . However, in higher dimensions, there is no straightforward analog of the reflection principle, and calculating hitting probabilities of spheres leads one to the deep theory of Bessel processes. Let us consider a Brownian particle ω(t) starting at the origin in R n and look at the real-valued random variable ω(t) (in the literature, these are known as Bessel processes). We are interested in the probability of the particle hitting a sphere {x ∈ R n : x = r} of radius r within time t. Curiously, it seems that there is no known closed formula for such a hitting probability. The only formula we know of is in the form of a convergent series involving zeros of the Bessel function of the first kind, and appears in Kent (1980) . For the reader interested in Kent's formula, we also refer to associated asymptotics of zeros of the Bessel function in Watson (1944) . The following heuristic is implicit in many of our calculations and motivates several of our definitions: the probability P sup 0≤s≤t ω(s) ≥ r (12) of a Brownian particle hitting a sphere of radius r within time t is dependent only the ratio r 2 /t. As a consequence, given a small η > 0 and a constant c, one can choose the constant c n in t = c n r 2 small enough (depending on η) such that P sup 0≤s≤cnr 2 ω(s) ≥ cr < η. ( ) Roughly what this means is the following: for a Brownian particle, the probability of hitting even a large and nearby object may be made arbitrarily small if the motion is not allowed to run sufficiently long.

A.2 HEAT DIFFUSION AND BROWNIAN MOTION DUALITY

Macroscopic vs microscopic There are roughly two broad viewpoints towards the understanding of diffusion: the "macroscopic" and the "microscopic". Macroscopically, the mechanism of diffusion can be thought of as creating a flux in the direction from greater to lesser concentration. If u(x, t) measures the intensity of the quantity undergoing diffusion, and J the flux across the boundary of a region Ω, then in the simplest model one assumes that (up to a constant) J = -∇u. Further, we have the identity ∂ t Ω u(x, t) dx = - ∂Ω ν. -∇u dS, ( ) where ν is the outward pointing unit normal vector to ∂Ω. By applying the divergence theorem to (14), one immediately gets the heat equation ∂ t u = ∆u. Here ∆ denotes the Laplace operator given by the sum of second derivatives: ∆ = n i=1 ∂ 2 ii . Now, many real-life diffusion processes are the result of microscopic particles jittering around seemingly in a random manner. This motivates the microscopic viewpoint, i.e., the modelling of heat diffusion via Brownian motion of particles. We posit that a particle located at x ∈ R n at time t 0 will have the probability ψ U (x, t) of being in an open set U ⊂ R n at time t 0 + t, where ψ U (x, t) = U p(t, x, y) dy, and p(t, x, y) is the fundamental solution of the heat equation, or more famously, the "heat kernel". In other words, p(t, x, y) solves the heat equation (∂ t -∆) u(x, t) = 0, u(x, 0) = δ(x -y), with the Dirac delta distribution as the initial condition. Via Fourier transform, it is easy to establish that p(t, x, y) is given by p(t, x, y) = 1 (4πt) n/2 e -|x-y| 2 4t . ( ) This builds the bridge to pass between analytic statements on the side of the heat equation and probabilistic statements on the side of Brownian motion (see Grigor'Yan (2001) , Taylor ( 2011)). The precise formulation of this duality is given by the celebrated Feynman-Kac theorem discussed in Subsection A.3 below.

Heating up the decision boundary

In our context we introduce the following heat diffusion process along the classifier's decision boundary N :    (∂ t -∆) ψ(x, t) = 0, ψ(x, 0) = 0, ∀x ∈ R n , ψ(x, t)| x∈N = 1, ∀t > 0. In other words ψ(x, t) gives the heat quantity at the point x at time t given that at the initial moment t = 0 all points have a heat quantity 0 and afterwards a constant heat source of intensity 1 is applied only at the decision boundary N . As remarked above this is the macroscopic picture: the mentioned Feynman-Kac duality implies that ψ(x, t) is also the hitting probability P ω [∃t 0 ∈ [0, t]|ω(t 0 ) ∈ N ].

A.3 THE FEYNMAN-KAC THEOREM

It is well-known that given a reasonable initial condition u(x, 0) = f (x), one can find an analytic solution to the heat equation via convolution with heat kernel, e t∆ f (x) := p(t, x, .) * f (.). This just follows from ( 16) by convolving directly. Now, via the duality of diffusion explained above, one expects a parallel statement on the Brownian motion side, one which computes the contribution of all the heat transferred over all Brownian paths reaching a point at time t. It stands to reason that to accomplish this, one needs an integration theory defined over path spaces, which leads us to the theory of Wiener measures. We describe the main idea behind Wiener measure briefly: consider a particle undergoing a random motion in R n (given by a continuous path ω : [0, ∞) → R n ) in the following manner: given t 2 > t 1 and ω(t 1 ) = x 1 , the probability density for the location of ω(t 2 ) is p(t, x, x 1 ) = 1 (4π(t 2 -t 1 )) n/2 e - |x-x 1 | 2 4(t 2 -t 1 ) . We posit that the motion of a random path for t 1 ≤ t ≤ t 2 is supposed to be independent of its past history. Thus, given 0 < t 1 < • • • < t k , and Borel sets E j ⊆ R n , the probability that a path starting at x = 0 at t = 0, lies in E j at time t j is E1 • • • E k p(t k -t k-1 , x k , x k-1 ) • • • p(t 1 , x 1 , 0) dx k • • • dx 1 . The aim is to construct a countably-additive measure on the space of continuous paths that will capture the above property. The above heuristic was first put on a rigorous footing by Norbert Wiener. Using the concept of Wiener measure, one gets the probabilistic (microscopic) description of heat diffusion, which is the content of the celebrated Feynman-Kac theorem: Proposition A.2. Let Ω ⊆ R n be a domain, with or without boundary (it can be the full space R n ). In case of a boundary, we will work with the Laplacian with Dirichlet boundary conditions. Now, let f ∈ L 2 (Ω). Then for all x ∈ Ω, t > 0, we have that e t∆ f (x) = E x (f (ω(t)) φ Ω (ω, t)) , where ω(t) denotes an element of the probability space of Brownian paths starting at x, E x is the expectation with regards to the Wiener measure on that probability space, and φ Ω (ω, t) = 1, if ω([0, t]) ⊂ Ω 0, otherwise. For a more detailed discussion, see Georgiev & Mukherjee (2018a) .

A.4 ISOPERIMETRIC AND ISOCAPACITORY RESULTS

Isoperimetric bounds Isoperimetric inequalities relating the volume of a set to the surface area of its boundary have given rise to a wealth of results Burago & Zalgaller (1988) . Given a set M with boundary ∂M , the basic pattern of isoperimetric inequalities is: Vol(M ) ≤ c 1 Area(∂M ) n n-1 , where c 1 is an appropriate positive constant depending on the dimension n. In many cases, equality (or saturation in the sense of almost equality) in ( 20) is characterized by rather special geometry. For example, classical isoperimetric results answer the question, which planar set with a given circumference possesses the largest area, with the answer being the disk. As discussed in the main text, isoperimetric considerations have recently lead to significant insights about decision boundaries of classifiers subject to adversarial defense training mechanisms Ford et al. (2019) by revealing flattening phenomena and relations to robustness.

Isocapacitory bounds

As mentioned in the main text, one can prove types of isocapacitory bounds that resemble the isoperimetric ones: roughly speaking, these replace the area term with suitable Brownian hitting probabilities. We have the following result (cf. also Georgiev & Mukherjee (2018a)): Proposition A.3. Let B(x, r) ⊂ R n , n ≥ 3, and let E ⊂ B(x, r) denote an "obstacle", and consider a Brownian particle started from x. Then the relative volume of the obstacle is controlled by the hitting probability of the obstacle: Vol(E) Vol(B(x, r)) ≤ c n (ψ E (x, t)) n n-2 . ( ) Here, c n is a positive constant whose value is dependent only on n provided the ratio between r 2 and t is suitably bounded. In particular, in the regime r 2 = nt, we have that c n = Γ n 2 -1 /Γ n 2 -1, n 4 n n-2 . Here, Γ(s, x) represents the upper incomplete Gamma function Γ(s, x) := ∞ x e -t t s-1 dt. Proof. Recall that the capacity (or more formally, the 2-capacity) of a set K ⊂ R n defined as Cap(K) = inf η| K ≡1,η∈C ∞ c (R n ) R n |∇η| 2 . ( ) From Section 2.2.3, Maz'ya (2011), we have the following "isocapacitory inequality": Cap(E) ≥ ω 2/n n n n-2 n (n -2)|E| n-2 n , Figure 5 : Examples illustrating the interplay between isoperimetric and isocapacitory saturation in high dimensions. (Left) Slightly bending a flat decision boundary N 0 causes significant changes in τ with the isoperimetric inequality still being very close to optimal: N + (resp. N -) leads to a increase (resp. decrease) in τ (cf. also Fig. 6 ). (Right) Small "pockets" near the data sample x can also cause large Brownian hitting probabilities (hence, large τ values) with still well-saturated isoperimetric bounds. where ω n = 2π n/2 Γ( n 2 ) is the (n -1)-dimensional surface area of S n-1 . Now, we bring in the following estimate given by Theorem 3.7 of Grigor'Yan & Saloff-Coste (2002) : ψ E (x, t) ≥ Cap(E) t 0 inf y∈∂E p(s, x, y) ds. Now, we have ψ E (x, t) ≥ ω 2/n n n n-2 n (n -2)|E| n-2 n t 0 1 (4πs) n/2 inf y∈∂E e -|x-y| 2 4s ds ≥ ω 2/n n n n-2 n (n -2)|E| n-2 n t 0 1 (4πs) n/2 e -r 2 4s ds = ω 2/n n n n-2 n (n -2)|E| n-2 n 1 4r n-2 π n/2 ∞ r 2 4t e -z z n/2-2 dz. After rearrangement the proposed claim follows. Intuitively, it makes sense that if the volume of a set is fixed, one can increase its hitting probability by "hammering" the set into a large thin sheet. However, it seems unlikely that after lumping the set together (as in a ball), one can reduce capacity/hitting probability any further. Moreover, isocapacitory bounds are saturated by the n-ball. It is also illustrative to compare the seemingly allied concepts of capacity and surface area. A main difference of capacity with surface area is the interaction of capacity with hitting probabilities. As an illustrative example, think of a book which is open at an angle of 180 • , 90 • , 45 • respectively. Clearly, all three have the same surface area, but the probability of a Brownian particle striking them goes from the highest to the lowest in the three cases respectively. It is rather difficult to make the heuristic precise in terms of capacity (at least from the definition). Capacity can be thought of as a soft measure of how "spread out" or "opened-up" a surface is, and is highly dependent on how the surface is embedded in the ambient space. Isocapacitory vs isoperimetric saturation A main line of analysis in the present work addresses the interplay between isocapacitory and isoperimetric saturation. In our particular context of defense training mechanisms we observe saturation of isoperimetric bounds for the classifier's decision boundaries -this implies that decision boundaries are not far from being flat. However, as mentioned before, it turns out that isocapacitory saturation does not concentrate around the values corresponding to hyperplanes (overall, it seems to stay well below that value). In this sense, isocapacitory saturation Figure 6 : A continuation on Fig. 5 : Isocapacitory and isoperimetric saturation while slightly bending the decision boundary (N -and N + in Fig. 5 ). In this plot the decision boundary N -, N + is a cap of a larger sphere with radius R (set initially to 15r) in dimension 3072 (corresponding to CIFAR10). We interpolate between N -and N + : first, by increasing the radius R, N -converges to the flat N 0 and, similarly, starting from N 0 we decrease R to get to N + . Along this interpolation process, we plot the graphs of the isocapacitory and isoperimetric saturation. In particular, we observe at least 96% saturation of the isoperimetric bound whereas the isocapacitory bounds shows a much more sensitive behaviour on this scale. acts as a finer sensitive measure of deviation from flatness. A simple model geometric scenario that provides similar behaviour is illustrated in Fig. 5 and Fig. 6 .

A.5 MODEL CASES

We first begin with the proof of Lemma 3.2. Proof. Let us select an orthonormal basis {e 1 , . . . , e n } so that e 1 coincides with the given hyperplane's normal vector. A standard fact about n-dimensional Brownian motion is that the projections on the coordinate axes are again one-dimensional Brownian motions Mörters & Peres (2010) . Thus, projecting the n-dimensional Brownian motion onto e 1 the hitting probability of the hyperplane is the same as the probability that one-dimensional Brownian motion ω(t) will pass a certain threshold d by time t. To compute this probability we use the reflection principle (11) in conjunction with Remark A.1. Consequently, the RHS is equal to 2Φ(-d/ √ t). The computation of µ(x, r) follows by definition. Here we note that the dimension n enters only in terms of the spherical cap volume. An impression how τ behaves for different choices of n in terms of the distance d is given in Fig. 7 . In particular, one observes the well-known concentration of measure phenomenon and Levy's lemma: the volume of the spherical cap exhibits a very rapid decay as n becomes large. Moreover, experiments reveal a curious phenomenon: there is a threshold distance d 0 until which τ ≈ 2 and afterwards τ explodes. In Fig. 8 we plot further interesting model cases where the error set forms a wedge (the region between two intersecting hyperplanes) or a cone.

Spiky sets

As discussed in the main text, one observes a high isocapacitory saturation τ for the so-called "spiky" sets -these are sets of relatively small volume and relatively large/dense boundary. Theoretically, a guiding model case in this direction is given by Lemma 3.3 in the main text, whose proof we now record. Proof. Let T ρ denote the ρtubular neighborhood of a line segment of length h inside R n . Clearly, T ρ ∼ = B(0, ρ) × [0, h], where B(0, r) is a ρ-ball inside R n-1 . By the well-known process of Steiner symmetrization in R n , it is clear that the expression for capacity in ( 22) will be minimized by a function that is "radially symmetric" around the central axis of the tube T ρ , that is f (x, y) = f (|x|), where x ∈ B(0, ρ), y ∈ [0, h]. Then, as we scale ρ → λρ, where λ 0, Cap (T λρ ) ∼ λ n-3 Cap (T ρ ) (which is seen directly from the definition ( 22)), whereas the volume scales as |T λρ | = λ n-1 |T ρ |. Now assume that the cylinder T ρ is inside the closed ball B(x, r) ⊂ R n , the central axis of T ρ is pointing towards x, and T ρ is touching the boundary of B(x, r). To pass from capacity to hitting probability of the set T ρ , we use that Grigor'Yan & Saloff-Coste (2002) : Cap(T ρ )r 2 Vol(B(x, r)) e -C r 2 t ≤ ψ Tρ (x, t). Finally, using the definition of τ and putting the above estimates together, one sees that in the time regime of O(r 2 ), τ scales like λ -2/(n-2) , and hence, τ ∞ as λ 0. See also Figure 8 for a visual discussion of the isocapacitory saturation for the model cases of wedges and cones.

A.6 CURVATURE ESTIMATES IN TERMS OF ISOCAPACITORY SATURATION

The geometric concept of curvature has a rich history and plays a central role in differential geometry and geometric analysis. There are several notions of curvature in the literature, ranging from intrinsic notions like sectional, Ricci or scalar curvatures to extrinsic (that is, dependent on the embedding) notions like principal curvatures and mean curvature, which are encoded in the second fundamental form. In this note we use a somewhat "soft" definition of curvature, following previous work Fawzi et al. (2016); Dezfooli et al. (2018) . Suppose the decision boundary N f is sufficiently regular (C 2 is enough for our purpose) and it separates R n into two components R 1 := {f > 0} and R 2 := {f < 0}, corresponding to a binary classification (the construction in the multi-label case is analogous). For a given p ∈ N f , let r j (p) denote the radius of the largest sphere that is tangent to N f at p, and fully contained in R j . Then, one defines the curvature κ at p as κ(p) = 1/ min (r 1 (p), r 2 (p)) . See Fig. 10 for a geometric illustration. However, it turns out that most notions of curvature are quite subtle (see Fawzi et al. (2016) ) and at this point, seemingly more cumbersome and intractable to Figure 10 : "Soft" definition of curvature given by the inverse radius of the osculating sphere. handle experimentally. We will take an indirect approach, and attempt to read off the effect of and on curvature via the isocapacitory saturation τ . Again, we begin with the model cases: we first study the behaviour of curvature κ if τ achieves its least possible value. We start by fixing some notation. As before let us consider a ball B(x, r) with an error set E ⊂ B(x, r) and boundary N = ∂E (clearly our main case of interest is E = E(y) ∩ B(x, r)). Let us denote the the distance d = d(x, N ) and suppose the point y ∈ N realizes this distance, i.e. d(x, y) = d. To rule out some degenerate cases and ease the analysis we introduce the following assumption: Assumption: The hypersurface N and the point x are on different sides of the tangent hyperplane H * := T y N (cf. Fig. 11 ). This assumption is also technically important, as otherwise low values of τ will be produced by annuli surrounding x. With that in place, we have the following rigidity result: Proposition A.4. Let us fix the distance d = d(x, N ) and suppose the assumption above holds. Then the least possible value of τ is attained only if the curvature κ of the hypersurface N is 0. Proof. As above let H * be the tangent hyperplane at distance d from x, and let C denote the (smaller) spherical cap formed by H * ∩ B(x, r). The proof relies on the following variational argument. If N is not the same as H * , then N ⊆ C, with y ∈ N ∩ H * . We wish to argue then one can perturb N infinitesimally to decrease the value of τ , so the only minimizer of the above expression has to be H * . The basic idea is to cut out a small piece p v around v and paste it in the region of around ṽ (Fig. 11 ). We say that N has positive curvature at some point z if the ball defining the curvature at z and the point x lie on different sides of N . The construction is as follows. Let S(x, s) be the (n -1)-sphere centered at x with radius s. We consider two cases: Case I: Let us suppose that there exist s 1 < s 2 ≤ r and points v, ṽ ∈ N such that the curvature of N at v ∈ N ∩ S(x, s 1 ) is greater than the curvature at ṽ ∈ N ∩ S(x, s 2 ). Let us, moreover, choose the infimum among such s 1 and the supremum among such s 2 . To define the mentioned piece p v , we consider two small balls B(v, ε), B(ṽ, ε) (where ε s 2 -s 1 ), and cut out a set p v = E ∩ B(v, ε) such that ∂ (E \ B(v, ε)) is congruent to N ∩ B(ṽ, ε) ( this is possible due to the curvature assumptions at v, ṽ). Then, we define the new error set E = E ∪ p ṽ \ p v and the boundary N = ∂E , where p ṽ represents the image of p v under the rigid motion and attached inside B(ṽ, ε) (see Fig. 11 ). It is now clear that |E| = |E |, but ψ E (x, T ) < ψ E (x, T ) for all T > 0. The last inequality follows from the evaluation of the explicit heat kernel that defines hitting probability ψ as stated by Feynman-Kac duality: ψ E (x, T ) = T 0 E 1 (4πt) n/2 e -(x-y) 2 4t dy dt > T 0 E 1 (4πt) n/2 e -(x-y) 2 4t dy dt = ψ E (x, T ). It follows from the definition of τ that τ E ≥ τ E . Case II: If Case I is not satisfied, then, similarly, we choose two points v, ṽ, but instead of defining the piece p v by intersection with a small ball around v we select p v as a "concavo-convex lens shape" domain, where the curvature on the concave "inner side" of p v of the lens is greater than that on the With reference to our previous discussion of spikes, it heuristically makes sense that a spike must have reasonably high curvature (it can have high curvature on the average, or if it is flat at most places, then have a sharp needle like end where the curvature is very high). In the same setting as Proposition A.4 let us, moreover, for simplicity assume that N is the graph of a function over the tangent hyperplane H * (Fig. 11 ). Proposition A.5. In the above setting let us fix the value of d. Then, if the maximum curvature κ max of N is sufficiently high (greater than some universal constant), then it satisfies κ max ≥ τ 1 n r Φ - d √ t -1 n-2 , ( ) where Φ denotes the c.d.f. of the standard normal distribution. If a point attaining this maximum curvature is within the half concentric ball B(x, r/2), then κ max satisfies the stronger estimate κ max ≥ τ 1 n (r -d) r n n-1 Φ - d √ t - n (n-1)(n-2) . ( ) Proof. Recalling the definition of the isocapacitory saturation τ , we will bound the numerator (resp. denominator) of τ from above (resp. below). First, for the numerator ψ E (x, t) we will use a basic monotonicity property of hitting probabilities stating that for two sets A ⊆ B one has ψ A (x, t) ≤ ψ B (x, t) -this follows directly from the definition of ψ. Now, since E ⊆ C where C is the smaller spherical cap of B(x, r) ∩ H * , we have ψ E (x, t) ≤ ψ C (x, t). However, recalling the explicit form of ψ C from Lemma 3.2 of the main text, we have ψ E (x, t) ≤ Φ - d √ t . Second, to bound the denominator of τ (i.e. Vol(E)), we observe that if κ max is large enough, by definition E contains a ball of radius 1 κmax , and Vol(E) ≥ ωn κ n max where ω n denotes the volume of unit n-dimensional ball. That finally implies, τ ≤ Φ - d √ t n n-2 Vol(B(x, r)) Vol(E) ≤ Φ - d √ t n n-2 r n κ n max , which proves ( 27). If a point of maximum curvature is inside a concentric ball of radius r/2, then E contains ≈ κmax(r-d) 2 balls of radius 1 κmax , which implies that Vol(E) ≥ κ max (r -d) ωn κ n max . The rest of the proof is similar. Now, we give a curvature estimate which works in any regime, without any restrictions. The tradeoff is a global average bound of the L p -type rather than pointwise estimates. Proposition A.6. In the setting as above, let us fix the distance d = d(x, N ). At each point of N , let us denote by κ the maximal sectional curvature of N at that point. The following estimate holds: K L 1 ≥ V n (d, r) - 2ω n r n Φ -d √ t τ H , ( ) where V n (d, r) denotes the volume of the smaller spherical cap at distance d, the constant ω n denotes the volume of unit ball in R n , and the function K is an integral function of the curvature κ over lines (defined in (31) below). Proof. Again, we suitably bound the numerator and denominator of τ . Starting with the numerator, as explained in Proposition A.5, we have by monotonicity ψ E (x, t) ≤ 2Φ - d √ t . ( ) To bound the denominator of τ we proceed as follows. Let N be the graph of the function g(x 1 , • • • , x n-1 ) , where the variables x j are taken from the hyperplane H * (Fig. 11 ) at distance d from x; the point at which N touches this hyperplane is taken as the origin. Let ϕ be a smooth cut-off function defined on the hyperplane such that ϕ ≡ 1 on the set S of all r) , and ϕ ≡ 0 outside the -tubular neighborhood of S. Finally, let g := ϕ g. (x 1 , • • • , x n-1 ) such that g(x 1 , • • • , x n-1 ) ∈ B(x, Now we see that, letting a = (r 2 -d 2 ) 1/2 , V n (d, r) -Vol(E) ≤ a ρ=0 S n-2 g (ρ, θ) ρ n-2 dρ dθ. Now, if η denotes the unit vector in the direction of a fixed (ρ, θ), observing that g (0) = 0, we have by the fundamental theorem of calculus g (ρ, θ) = 1 0 ∂ t g (tρη, θ) dt. In turn, applying the fundamental theorem a second time and observing that ∇g (0) = 0, we have that g (ρ, θ) = 1 0 1 0 ∂ s ∂ t g (stρη, θ) ds dt. Putting everything together we get, V n (d, r) -Vol(E) ≤ a ρ=0 S n-2 1 0 1 0 ∂ s ∂ t g (stρη, θ) ds dt ρ n-2 dρ dθ. Now, we define the following integral quantity: K (ρ, θ) = 1 0 1 0 |κ (stρη, θ)| ds dt. Noting that the maximum sectional curvature bounds the second derivatives, finally we have that V n (d, r) -Vol(E) ≤ K L 1 . To obtain ( 29) we now put all the above estimates together and let 0.

B APPENDIX B: GENERALIZATION BOUNDS AND COMPRESSION SCHEMES

Background A main line of ML and statistical inference research addresses questions of generalization. To set the stage we start with some notation. Let us suppose that the dataset X is sampled from a probability distribution D, i.e. (x, y) ∼ D. Following conventions from the literature Arora et al. (2018) we define the expected margin loss of a classifier f by L γ (f ) := P (x,y)∼D f (x)[y] ≤ γ + max j=1,...,k;j =y f (x)[j] . We use the notation Lγ to denote the expected empirical margin loss over the given data set X . Finally, the generalization error is defined as L γ -Lγ . Quite roughly speaking, standard generalization results attempt to estimate the performance of the classifier on unseen samples (i.e. the full data distribution), thus yielding bounds of the form: L γ1 (f ) ≤ Lγ2 (f ) + F (γ 1 , γ 2 , f, X ), where F is an additional term that usually depends, e.g. on the size of X , the expressiveness of f and further margin information (γ 1 , γ 2 ).

B.1 COMPRESSION IN A HEAT DIFFUSION SENSE IMPLIES GENERALIZATION BOUNDS

We first state a well-known concentration inequality due to Hoeffding which will find repeated use in the ensuing sections: Proposition B.1 (Hoeffding's inequality). Let X 1 , . . . , X n be independent random variables taking values in the interval [0, 1], and let X = 1 n (X 1 + • • • + X n ) be the empirical mean of these random variables. Then we have: P X -E X ≥ t ≤ e -2nt 2 . ( ) We now provide the proof of Proposition 5.1 of the main text. Proof. The strategy of proof follows well-known "weak-law-of-large-numbers" concentration techniques in a spirit similar to Arora et al. (2018) . Step 1. First, we show that for a given g as |X | → ∞, P (x,y)∼X (C g (x, y, t 1 )) → P (x,y)∼D (C g (x, y, t 1 )) , where C g (x, y, γ 2 ) is the event that a Brownian path starting at x hits E g (y) within time γ 2 . The rate of convergence is determined through Chernoff concentration bounds. Choose α ∈ A, and let g α be the corresponding classifier. Attached to each sample point x j , there is a Bernoulli random variable X j which takes the value 1 if C gα (x j , y, γ 2 ) happens, and 0 otherwise. Then, the average X = 1 m m j=1 X j is given by the average of m i.i.d. Bernoulli random variables each of whose expectations is given by P (x,y)∼D C gα (x, y, γ 2 ). Furthermore, we note that if a data sample is misclassified, then the Brownian particle almost surely will hit the error set. Combining this observation with the concentration estimate (35) above, we obtain L 0 (g α ) ≤ P (x,y)∼D C gα (x, y, γ 2 ) ≤ P (x,y)∼X C gα (x, y, γ 2 ) + ξ, with probability at least 1 -e -2ξ 2 m . If each classifier g α has q parameters, each of which can take r discrete values, we take ξ = q log r m . Step 2. The estimate from the previous step should hold for every classifier g α in the family A with large probability. This is guaranteed by a union bound and tuning the Chernoff bounds from the convergence rate. More precisely, there are r q different choices α ∈ A, and hence by taking the union of the estimate in (37), one can say that P (x,y)∼D C gα (x, y, γ 2 ) ≤ P (x,y)∼X C gα (x, y, γ 2 ) + q log r m with probability at least 1 -e -q log r over all α ∈ A. Step 3. Finally one uses the fact that f is approximable by at least one g = g α0 for some α 0 in A. Via Definition 1 of the main text, one sees that P (x,y)∼X C gα 0 (x, y, γ 2 ) ≤ P (x,y)∼X C f (x, y, γ 2 ) + η, which finally gives that with probability at least 1 -e -q log r , we have L 0 (g) ≤ P (x,y)∼X C f (x, y, γ 2 ) + η + O q log r m . ( ) Remark B.2. As noted, a classifier f classifies a point x wrongly if and only if ψ E(y) (x, t) = 1 for all time scales t. With this observation, and since (39) works for all real numbers γ, letting γ → 0, we have that with probability at least 1 -e -q log r , L 0 (g) ≤ L0 (f ) + η + O q log r m . This recovers a loss estimate which is similar to the estimate in Theorem 2.1 of [1]. Indeed, one can consider P (x,y)∼X C f (x, y, γ 2 as a "soft" or probabilistic measure of classification with margin ≈ γ. When defining the notion of a compression, instead of taking a pointwise difference as in Definition 1 of Arora et al. (2018) , we would like to capture the idea that the decision boundary of a good compression should be "close enough" to the decision boundary of the original classifier. In our context, this implies that their "heat signatures" at the sample points should be close enough at all time scales. As noted in the main text, Definition 1 is definitely one natural option to define goodness of compression in a heat-diffusion sense. Another natural way is to consider the Brownian motion's running time and define a good approximation as follows: Definition 3. Given a positive real number η, a classifier g is said to be an η-compression w.r.t. hitting time of f if ψ Eg(y) (x, γ 2 -η) ≤ ψ E f (y) (x, γ 2 ) ≤ ψ Eg(y) (x, γ 2 + η) for all points x in the training sample, labels y and real numbers γ 2 ≥ η. Analogously, we have the following Proposition B.3. Let us suppose that f is approximable by g in the sense of Definition 3. Here g ∈ A, where A is a family of classifiers R n → R parametrized by q parameters assuming r discrete values. As before, for a classifier h, let C h (x, y, t) be the event that a Brownian path starting at x hits E h (y) within time t. Then we have L 0 (g) ≤ P (x,y)∼D C gα (x, y, γ 2 -η) ≤ P (x,y)∼X C f (x, y, γ 2 ) + O q log r m with probability at least 1 -e -q log r . The proof proceeds similarly as above. Letting γ 2 → η gives us L 0 (g) ≤ P (x,y)∼X (C f (x, y, η)) + O q log r m . Again, the first term on the RHS can be interpreted as the geometric margin of classification. In particular, if the classifier f separates points by a distance of ≈ √ nη, then since the Brownian motion travels ≈ √ nη hitting the error set will happen only if a misclassification occurred, i.e. we have P (x,y)∼X (C f (x, y, η)) ≈ L 0 (f ).

B.2 A SHARP VARIANT OF THE JOHNSON-LINDENSTRAUSS ALGORITHM

Several state-of-art compression schemes utilize a dimensionality reduction in the spirit of Johnson-Lindenstrauss (JL), Arora et al. (2018) . In this Subsection we discuss a JL compression scheme that will later be coupled with and tuned by some heat-diffusion estimates. We begin by discussing a variant of JL (Alg. 1). Data: Original matrix A of dimension h 1 × h 2 , β ∈ (0, 1). Result: Stochastic compressed matrix Â with O log(h 1 h 2 )/βα 2 non-zero entries such that P Âx -Ax ≥ α A F x ≤ β. Start with matrix A, real number α; while i ≤ h 1 , j ≤ h 2 do Let z ij = 1 with probability p ij = 2a 2 ij βα 2 A 2 F , 0 otherwise; Let âij = zij aij pij . end Return Â = (â ij ). Algorithm 1: Compressing a matrix A ∈ R h1×h2 Proposition B.4. Let A be a matrix of dimension h 1 × h 2 . Then, one can find a compressed matrix Â such that Ax -Âx ≤ α A F x , with probability at least 1 -β, where the number of parameters of Â is O log(h 1 h 2 )/βα 2 . A proof of Proposition B.4 in the spirit of classical JL can be provided -however, here we introduce a Bernoulli scheme which is a minor modification of Algorithm 2 of Arora et al. (2018) . Proof. Define the random variables z ij which take the value 1 with probability  p ij = 2a 2 ij βα 2 A 2 F , P Âx -Ax ≥ α A F x ≤ β. Now, the expected number of non-zero entries in Â is i,j p ij = 2 βα 2 . An application of Chernoff bounds now gives that with high probability the number of non-zero entries is O log(h 1 h 2 )/βα 2 .

B.3 HITTING PROBABILITY, CAPACITY SENSITIVITY AND COMPRESSION

As discussed in the main text, here we use hitting probabilities associated to the decision boundary to define a concept "capacity sensitivity" of a neural net layer. The heuristic is, the less the capacity sensitivity of a layer, the greater the facility in compressing the layer to one with fewer parameters. This goes in the spirit of current state-of-art results on compression and generalization bounds (Arora et al. (2018) , Suzuki et al. (2018) , Suzuki et al. (2020) ). In particular, in Arora et al. (2018) the authors provide the notions of noise sensitivity and noise cushions motivated by Gaussian noise injections. Our first proposed definition for "heat-diffusion noise cushions" and capacity sensitivity goes as follows: Definition 4. Let η ∼ N be distributed along a noise distribution N concentrated in ball η ≤ η 0 . We define the capacity sensitivity S(x, A i ; t) of a layer A i at the point x as S(x, A i ; t) := E η∼N ψ E f (φ(A i (x + x η)), t) -ψ E f (φ(A i x), t) ψ E f (φ(A i x), t) . ( ) We denote the maximum and expected sensitivity respectively as S m (A i ; t) := max x∈X S(x, A i ; t), S e (A i ; t) := E x∼X S(x, A i ; t). Now we use Algorithm 1 to investigate a method for compressing a layer A i so that the capacity properties are preserved. Proposition B.5. Let a particular layer A i of the neural net be of dimension h 1 ×h 2 . Then, Algorithm 1 generates an approximation Âi with O log(h 1 h 2 )/βα 2 parameters for which we guarantee that ψ E f (y) (φ( Âi )) is proportional to ψ E f (y) (φ(A i )) up to an error with probability β + S m (A i )/ . Proof. Using the fact that ψ E f (y) φ Âx , t = ψ E f (y) (φ (A(x + x η)) , t), let A δ denote the event that ψ E f (y) φ( Âi x), t -ψ E f (y) (φ(A i x), t) ψ E f (y) (φ(A i x), t) = ψ E f (y) (φ(A i (x + x η)), t) -ψ E f (y) (φ(A i x), t) ψ E f (y) (φ(A i x), t) ≥ δ. For every fixed x ∈ X , using (44) and Markov's inequality immediately implies P [A δ ] ≤ S(x, A i ; t) δ . Since Algorithm 1 yields controlled distortion, we have that given error parameters α, β, one gets Â, a stochastic approximation of A such that P Âi (x) -A i (x) ≥ α A i F x ≤ β. Here the reduced number of the parameters of Â is O log(h 1 h 2 )/βα 2 . With that, we have P Âδ = P Âi (x) -A i (x) α A i F x < 1 Âδ + P Âi (x) -A i (x) α A i F x ≥ 1 Âδ (48) ≤ P [A δ ] + P Âi (x) -A i (x) α A i F x ≥ 1 ≤ S(x, A i ; t) δ + β. This concludes the claim. The above proposition may seem suboptimal and even somewhat of a tautology, but we include all the details, because one way forward is now evidently clear. In particular, the step in ( 48) can be improved if we know that if the distance between two vectors z and w is bounded above, then ψ E f (z, t) -ψ E f (w, t) is bounded above. In plain language, we would like to say the following: if two points are close, then the respective probabilities of Brownian particles starting from them and hitting N f are also close. This is too much to expect in general, but can be accomplished when one places, in addition, certain nice assumptions on the decision boundary.

B.4 PROOF OF FIRST PART OF PROPOSITION 5.2 OF THE MAIN TEXT

We will break down the proof over three propositions, to illustrate the flow of ideas. The first is the case of the hyperplane which we discussed to some extent above in our curvature analysis (see also Lemma 3.2 of the main text). Proposition B.6. If the decision boundary N f is a hyperplane, then given β, , one can find an α for which the compression scheme of Algorithm 1 gives a compression of a layer A i of dimension h 1 × h 2 to Âi with O log(h 1 h 2 )/βα 2 parameters such that P A i (x) -Âi x ≤ α A i F x ≥ 1 -β, and ψ E f (A i x, t) -ψ E f ( Âi x, t) ≤ with probability at least 1 -β. Here t = O dist(A i (x), N f ) 2 . The choice of α is made explicit by (50) below. Proof. Let w, z ∈ R n be two points such that w -z ≤ δ. It is clear that the maximum value of ψ E f (w, t) -ψ E f (z, t) is given by the probability that a Brownian particle starting from a point x ∈ R n strikes a "slab" of thickness δ at a distance d -δ from x (a slab is a tubular neighborhood of a hyperplane) within time t. Without loss of generality, assume that the point z is at a distance d from the hyperplane N f . Then, 0 ≤ ψ E f (w, t) -ψ E f (z, t) ≤ 2 Φ - d -δ √ t -Φ - d √ t , which implies that ψ E f (w, t) -ψ E f (z, t) ψ E f (z, t) ≤ 2   Φ -d-δ √ t Φ -d √ t -1   . From the above calculation, we get that A i (x) -Âi (x) A i F x ≤ α =⇒ ψ E f (A i (x), t) -ψ E f ( Âi (x), t) ψ E f (A i (x), t) ≤ 2   Φ -d-δ √ t Φ -d √ t -1   , where δ = α A F x . We wish to apply the above estimate in the regime t = O(d 2 ). For the sake of specificity, let t = c n d 2 . Now, given , from (49) one can choose α such that P Âi (x) -A i (x) α A i F x ≤ 1 A = 0. It suffices to choose α such that when t = c n d 2 , 2   Φ -d-δ √ t Φ -d √ t -1   = , where δ = α A F x . Then, P[A ] ≤ β. Remark B.7. In the above calculation, the nonlinearity φ can be introduced easily. Clearly, by the compression properties of Algorithm 1, we have that Âi (φx) -A i (φx) ≤ α A i F φx ≤ αλ A i F x , where λ is the Lipschitz constant associated to the nonlinearity φ. In particular, if φ is the ReLU, then λ = 1. This gives us that if Âi (x) -A i (x) ≤ α A F x , A i (φx) -Âi (φx) A i F x ≤ αλ =⇒ ψ E f (A i (x), t) -ψ E f ( Âi (x), t) ψ E f (A i x, t) ≤ 2   Φ -d-δ √ t Φ -d √ t -1   , where δ = αλ A F x . We mention in passing that the above proposition gives a connection between our capacity sensitivity S(x, A; t) and the noise sensitivity ψ N defined by Arora et al. (2018) . Now consider the case of a curved hypersurface, denoted by H (which is being thought of as the decision boundary N f ), which is "sandwiched" between two hyperplanes H 1 and H 3 . Assume that the hypersurface is at a distance d from the point z, and the distance between H 1 and H 3 is l. Proposition B.8. In the above setting, all the conclusions of Proposition B.6 apply to H. Proof. We have that ψ N f (z, t) -ψ N f (w, t) is less than or equal to the maximum of the quantities Φ -d √ t -Φ -d+δ+l √ t , Φ -d √ t -Φ -d-δ+l √ t , Φ -d+l √ t -Φ -d+δ √ t , Φ -d+l √ t -Φ -d-δ √ t . Let M (d, t) denote this maximum. As argued before, ψ N f (z, t) ≥ Φ -d+l √ t . That gives, ψ E f (z, t) -ψ E f (w, t) ψ E f (z, t) ≤ M (d, t) Φ -d+l √ t . The rest of the argument is similar to the proof of Proposition B.6, and we skip the details. Before moving on to the case of controlled curvature, we need a technical lemma. We state it explicitly because it seems to us that it could have potentially other applications. Lemma B.9. Let p ∈ R n , and consider a cuboid Q ⊂ R n with side lengths a 1 , • • • , a n . Let q ∈ Q be the unique point which attains d = p -q = dist(p, Q). Lastly, assume that the line segment pq is perpendicular to the side of Q on which q lies. Then ψ Q (p, t) = 2 n Φ - a 1 √ t -Φ - a 1 + d √ t n j=2 Φ a j 2 √ t -Φ - a j 2 √ t . Proof. The proof follows easily from the fact that in an n-dimensional Brownian motion, all the coordinates execute the standard 1-dimensional Brownian motion independently, and then by applying the reflection principle. The ideas are very similar to the proof of Lemma 3.2 of the main text. As an immediate application of Lemma B.9, we now show that the nice properties of the decision boundaries as mentioned in Propositions B.6 and B.8 above are also shared by hypersurfaces with controlled curvature. Proposition B.10. Let H be a hypersurface which is diffeomorphic to a hyperplane, of curvature κ (in the sense of ( 26)) satisfying r ≤ κ ≤ R. Then the conclusion of Proposition B.6 applies to H. Proof. Let z be a point such that d := dist(x, H), and w be another point such that z -w = δ. Let E denote the misclassification region defined by H. |ψ E (z, t) -ψ E (w, t)| ≤ ψ A (z, t), where A denotes the region "sandwiched" between H and H -δ. As before, we will ultimately use t in the regime O(d 2 ). Now, given t, start by considering a ball B(z, λ t ), and let A λt := A ∩ B(z, λ t ). Here, λ t has been chosen so that ψ A λ t (z, t) comes arbitrarily close to ψ A (z, t). We will now cover A λt with N cubes Q j , j = 1, • • • , N such that each cube Q j has sidelengths comparable to δ. Due to the controlled curvature, we know that the cover has controlled multiplicity and N ∼ r,R,λt 1/δ n-1 . Since we know that ψ A λ t (z, t) ≤ N j=1 ψ Qj (z, t), it suffices to prove that the RHS above is O(δ). Via Lemma B.9 above, it suffices to prove the following: a -a e -x 2 dx = O(a). Now, we employ the following known trick: a -a e -x 2 n = a -a e -r 2 r n-1 dr dω = 2 a 2 0 e -ρ ρ n/2-1 dρ = 2γ(n/2, a 2 ), where γ(s, x) denotes the usual lower incomplete Gamma function. From well-known asymptotics, it is now clear that for small enough a, the RHS is O(a).

B.5 COMPRESSION PARAMETERS: GENERAL CASE

Now we go for the full neural net compression, which is essentially an iterated version of Proposition B.5. Consider a neural net A consisting of m layers, and let Âj denote the neural net A whose first j layers have been compressed using the scheme in Algorithm 1 at each level. By way of notation, let A j denote the jth layer of the original neural net (assumed to be of dimension h 1 j × h 2 j ), and Âj the jth layer of the compressed neural net. Then, we have the following Proposition B.11. Given ε > 0 and m parameter pairs (α j , β j ), we can find a compression Âm with m j=1 O log(h 1 j h 2 j )/β j α 2 j parameters and associated parameters ρ j such that ψ E f (Ax, t) -ψ E f ( Âm x, t) ≤ m j=1 ρ j < ε with probability at least m j=1 τ j , where τ j = j i=1 (1 -β i ) -S(x j-1 , A j ; t) . Proof. We see that ψ E f (Ax, t) -ψ E f ( Âm x, t) ≤ ψ E f (Ax, t) -ψ E f ( Â1 x, t) + ψ E f ( Â1 x, t) -ψ E f ( Â2 x, t) + ψ E f ( Â2 x, t) -ψ E f ( Â3 x, t) + • • • + ψ E f ( Âm-1 x, t) -ψ E f ( Âm x, t) . We will be compressing one individual layer at at time. At the first layer, we start with the entry x taken from the sample set. Algorithm 1 gives us a compression Â1 that satisfies, with given α 1 , β 1 that A 1 x -Â1 x ≤ α 1 A 1 F x with probability at least 1-β 1 . Here the reduced number of parameters of Â1 is O log(h 1 1 h 2 1 )/β j α 2 j . As a result, ψ E f (Ax, t) -ψ E f ( Â1 x, t) ≤ ρ 1 , where in the general situation (that is, without any additional assumption on the decision boundary N f ), ρ 1 = ψ E f (φ(A 1 x), t)δ 1 with probability at least 1 -S(x, A 1 ; t)/δ 1 -β 1 (this is via Proposition B.5, via application of Markov's inequality). Now that the first layer has been compressed, the entry data at the second layer is the vector φ Â1 x. Once again, we estimate that with given parameters α 2 , β 2 , Algorithm 1 generates a contraction Â2 at the second layer with satisfies (with probability at least 1 -β 2 ) A 2 (φ Â1 x) -Â2 (φ Â1 x) ≤ α 2 A 2 F φ Â1 x ≤ λα 2 A 2 F Â1 x (Lipschitz-ness of the nonlinearity) So, with probability at least (1 -β 2 )(1 -β 1 ), we have that A 2 (φ Â1 x) -Â2 (φ Â1 x) ≤ λα 2 A 2 F A 1 x + α 1 A 1 F x ≤ λα 2 A 2 F A 1 F x + α 1 A 1 F x = λα 2 (1 + α 1 ) A 2 F A 1 F x . We have then ψ E f ( Â1 x, t) -ψ E f ( Â2 x, t) ≤ ρ 2 , where in the general situation, ρ 2 = ψ E f (φ(A 2 x1 ), t)δ 2 with probability at least (1 -β 1 )(1 -β 2 ) - S(x 1 , A 2 ; t)/δ 2 . Here xj denotes the output at the jth layer of the compressed net. It can be checked via induction that the above process iterated j times gives that A j (φ(x j-1 )) -Âj (φ(x j-1 )) ≤ λ j-1 α j j-1 i=1 (1 + α i ) j i=1 A i F x with probability at least j i=1 (1 -β i ). That implies that ψ E f ( Âj-1 x, t) -ψ E f ( Âj x, t) ≤ ρ j , where in the general situation, ρ j = ψ E f (φ(A j xj-1 ), t)δ j with probability at least τ j = j i=1 (1 -β i ) -S(x j-1 , A j ; t)/δ j . Finally, this implies that As an alternative working definition of noise sensitivity, we define the following: Definition 5. ψ E f (Ax, t) -ψ E f ( Âm x, t) ≤ m j=1 ρ j ,

S(x, A

; t) := E γ∈B,η∼N ψ E f ,γ (φ(A(x + x η)), t) -ψ E f (φ(Ax), t) ψ E f (φ(Ax), t) , ( ) where the expectation is over η ∈ N and all Brownian paths γ starting at the point φ(A(x + x η)) and ending inside E f (y) within time t (the latter sits inside the path space starting at φ(A(x + x η)) and endowed with the Wiener measure). The random variable ψ E f ,γ (φ(A(x + x η)), t) is defined as 1 if the path γ l strikes E f within time t and 0 if it does not. From the point of view of ML computation, Definition 5 has a slight advantage over Definition 4. In other words, it is computationally more efficient in view of the following sampling scheme: Proposition B.12. If η 1 , ..., η m denote m sampled values of η and γ j1 , γ j2 , ..., γ jk denote k sampled Brownian paths starting at x + x η j , then X = 1 mk m j=1 k l=1 X jl , where X jl = ψ E f ,γ l (φ(A(x + x η j )), t) -ψ E f (φ(Ax), t) ψ E f (φ(Ax), t) approximates S(x, A; t) well with high probability. Proof. Begin by sampling m values η 1 , ..., η m of η and k Brownian paths γ j1 , γ j2 , ..., γ jk starting from each such x + x η j . Attached to each such selection is an independent random variable X jl ψ E f (φ(Ax), t) which takes values in [0, 1]. For each j, l, we have that E X jl ψ E f (φ(Ax), t) = S(x, A; t)ψ E f (φ(Ax), t). Let X denote the mean of all the random variables X jl , j = 1, .., m, l = 1, ..., k. Now, we can bring in Hoeffding's version of the Chernoff concentration bounds, which gives us that P X -S(x, A; t) ≥ τ ψ E f (φ(Ax), t) ≤ e -2τ 2 mk . C APPENDIX C: DATASETS, SAMPLING DETAILS, TRAINING DETAILS AND FURTHER EXPERIMENTS.

C.1 TECHNICAL SETUP

The experimental section of the work was conducted mainly on a CUDA 10.2 GPU-rack consisting of four NVIDA TITAN V units: this includes the model training as well as Brownian motion sampling and further statistics. The neural network framework of choice was PyTorch 1.5. We provide the training as well as the sampling code for our experiments.

C.2 DATASETS

We worked with the well-known MNIST and CIFAR-10 datasets. The MNIST is a 784-dimensional dataset that consists of 60000 images of handwritten digits whose dimensions are (28, 28); 50000 images were used for training and 10000 for validation. CIFAR-10 is collection of 60000 32-by-32 color images (i.e. a 3072-dimensional dataset) corresponding to 10 different classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks; 50000 images were used for training and 10000 for validation. As pointed out in the main text, adversarially robust decision boundaries exhibit fundamental differences between the MNIST and the CIFAR-10 dataset. MNIST yields particularly simple robust boundaries stemming from it's almost binary nature as elaborated in Schmidt et al. (2017) and confirmed in Ford et al. (2019) . CIFAR-10 on the other hand is notoriously vulnerable to attacks, which is reflected in the quantities we measure. For our experiments this means that adversarial/noisy training flattens the surrounding boundary, i.e. saturates the isoperimetric bound, but nevertheless still exhibits spiky structure as will be reflected in the measurements of the isocapacitory bounds. For MNIST on the other hand the approximately binary nature of the examples gives the decision boundary much less 'freedom', resulting in a less distinct quantitative representation. For some exploratory toy-examples (cf. Fig. 1 , Fig. 2 , Fig. 3 in the main text) we generated a planar dataset that alternates along a circle of radius r = 5: for a given ray through the origin we generate several points on the ray at approximately distance r from the origin and assign them to class 0; then we rotate the ray by a small angle counter-clockwise, sample several points on the rotated ray again at approximately distance r from the origin and this time assign them to class 1. Repeating this process we produce the mentioned 2-class dataset that alternates along the circle of radius r and consists of 1250 points.

C.3 SAMPLING DETAILS

An evaluation of the isocapacitory saturation ψ is obtained by sampling 10000 Brownian paths with 400 steps. In light of the curse of dimensionality, this configuration seems adequate for our purposes: theoretically, by projecting Brownian motion along the normal directions of the decision boundary one sees that estimating hitting probabilities is essentially a lower dimensional problem, e.g. 1-dimensional if the decision boundary is a hyperplane; practically, our experiments were numerically stable w.r.t. resampling and sample-batch-size. Further, for each data point x the relative error volume µ(x, r) is computed by sampling 10000 points uniformly in B(x, r). To compare with isoperimetric bounds (Subsection 3.2) for each data point x we sample 1000 points, normally distributed N (x, r/ √ n) and concentrated around x in the ball B(x, r), and apply a PGD with 400 steps to obtain distance to the decision boundary N (a setup similar to Ford et al. (2019) ). As above, repetitive runs on average reveal an acceptable numeric stability to the order of 10 -4 .

C.4 DEFENSE TRAINING: FGSM VS PGD

In the present work we are interested in how adversarial/noise defense training are reflected geometrically. To this end we study the application of two defense strategies -FGSM and PGD. Previous work (Ford et al. (2019) ) indicates that FGSM-based training already leads to boundary flattening. However, in general it cannot be guaranteed that the FGSM-based adversarial training will Figure 13 : The statistics obtained from the Residual Networks with 32, 44 and 56 layers on the CIFAR10 dataset. For this experiment we considered the Brownian particles with average displacement equal to the radius of sphere with relative volume µ = 0.01, where µ is defined according to equation (2) in the main text. The considered quantities are (Left) the probability of a Brownian particle to collide with the decision boundary, (Center Left) the isocapacitory bound, i.e. the ratio of said probability versus relative volume µ, (Center Right) the radius of the obtained sphere equal to the RMSD of the particle and (Right) the saturation of the isoperimetric bound. We observe consistent behavior of the shown quantities for all three models. The trend of isoperimetric saturation (although, not so concentrated as in the case of WRN and LeNet-5, Fig. 4 ) as well as the increase of distances r are present. Again the isocapacitory saturation does not appear to follow a distinguished concentration around the case of a flat decision boundary despite the overall increase in flatness: here both noisy and adversarial training seem to deliver a decrease in τ . In fact, the heat imprint of the ordinarily trained model exhibits a "flatter" behaviour in terms of τ . 



Figure 1: Heating up a planar decision boundary of a 5-layer MLP over time. The amounts of radiated heat reflect the geometry of the decision boundary: size, density, curvature.

Figure 2: A planar 2-class dataset that alternates along a circle. (Left) A depiction of the planar circle-like dataset and the corresponding decision boundary of a 5-layer MLP. (Center) Brownian paths starting at a data point x and killed upon impacting the decision boundary/opposite class. (Right) Set-up of the local Brownian motion analysis with notation on radius r, dimension n and Brownian runtime t.

Figure 4: Results for a Wide-ResNet 28-10 and a LeNet-5 trained on CIFAR10 and MNIST, respectively. Different boxplots correspond to different training strategies: ordinary, adversarial, with noise or with a Brownian augmentation. Data is collected over 1000 test data points, where each radius r is selected so that the relative error volume µ equals 1%. Left-to-right the columns correspond to the isocapacitory saturation τ , the radius r realizing µ = 1% and the isoperimetric saturation. Finally, red punctured horizontal lines indicate the corresponding values for flat decision boundaries.

Figure 7: The isocapacitory saturation τ of a flat error set. Given a point x, the computation takes place in B(x, r) with r = 1. The distance to the flat decision hyperplane is given on the x-axis, while the y-axis gives τ . Curve labeling indicates the respective dimension. There appears to be a threshold dividing between the regimes τ ≈ 2 and τ → ∞.

Figure 8: Further model cases and plots of the isocapacitory saturation τ . (Left) Isocapacitory saturation of cone in terms of the opening angle (radians). (Right) Isocapacitory saturation of wedge in terms of the opening angle (radians). Curve labels indicate the respective dimension. Again one observes concentration of measure as the volume of the cone decreases to 0 exponentially fast in terms of the dimension n: this is why we plot the opening angle around π in this case. Furthermore, cones and wedges with an opening angle of almost π behave like hyperplanes in terms of saturation.

Figure 9: Cylindrical "spike" of height h and radius ρ inside the ball B(x, r).

Figure 11: Moving the piece p v near the tip of the obstacle and reattaching it far away as p ṽ reduces the hitting probability, but preserves volume.

Figure 12: Covering by cuboids of side length δ.

Figure14: Statistics for a convolutional neural network with four convolutional and two linear layers applied to the MNIST dataset. This particular convolutional model shows that not every architecture/training/dataset instance displays the distinguished trend in increasing the isoperimetric saturation -however, even in this scenario the isoperimetric saturation is quite sharp. Similar to other experiments above, the isocapacitory saturation τ on the other hand does not concentrate to such an extent.

Figure 15: Typical examples of the CIFAR-10 dataset used to train the models. From left to right, the clean image, a PGD adversarial example, a Gaussian perturbation (σ 2 = 0.4) and the terminal point of a Brownian random walk (undirected attack) immediately after colliding with the decision boundary are shown. The comparison between the PGD adversarial example and the right picture emphasize the degree to which spikes in the decision boundary deviate from the average distance between boundary and clean example.

Figure 16: Evaluation of the accuracies of the LeNet-5 (MNIST) models during a range of attacks. While for clean data all models exhibit almost similar accuracy, the adversarially trained models exhibit more robustness during various attacks. For all measures we see the worst performance of the models trained on randomly chosen adversarial examples.

this is achieved via certain local Rademacher complexity and "peeling" techniques. However, we choose not to pursue these bounds in the present work and assume the perspective inArora et al. (2018) that g, being almost equivalent to f , provides a reasonable indicator of generalization capabilities.

and the value 0 otherwise. Define âij = zij aij pij . One can now calculate that E (â ij ) = a ij , and Var (â ij ) ≤ βα 2 A 2 F . Using the above, one can further calculate that E( Âx) = Ax, and Var( Âx) ≤ x 2 A 2 F βα 2 . By Chebyshev's inequality, this gives us that

We are left to indicate the proof of the second part of Proposition 5.2 from the main text. This follows in a straightforward way following the proof of Proposition B.11 using the bounds in Propositions B.6, B.8 and B.10 at every step, instead of the bounds in Proposition B.5, as we have done in the above proof.

Summary of validation accuracies for Wide-ResNets 28-10 for various training methods on the CIFAR10 data set. ResNet 28-10 trained on noise (σ 2 = 0.1) 91.22 % Wide-ResNet 28-10 trained on noise (σ 2 = 0.4) 86.07 % Wide-ResNet 28-10 adversarially trained (fgsm) 87.10 % Wide-ResNet 28-10 adversarially trained (pgd) 85.05 %

Summary of validation accuracies for the ResNets with 32, 44 and 56 layers for various training methods on the CIFAR10 data set.

Summary of validation accuracies for LeNet-5 and a convolutional neural network with four convolutional and two linear layers for various training methods on the clean MNIST data set.

ACKNOWLEDGMENTS

We would like to thank our anonymous reviewers whose advice helped improve the quality of the presentation. We are indebted to Prof. Christian Bauckhage for his constant encouragement, support and fruitful discussions. We also sincerely thank Benjamin Wulff for maintaining the outstanding computation environment at Fraunhofer IAIS -his support and coffee conversations played an essential role for our empirical analysis. In part, this work was supported by the Competence Center for Machine Learning Rhine-Ruhr (ML2R) which is funded by the Federal Ministry of Education and Research of Germany (grant no. 01IS18038B). We gratefully acknowledge this support.

annex

Published as a conference paper at ICLR 2021 provide appropriate levels of robustness (against strong adversaries, e.g. iterative attacks) -recently, Wong et al. (2020) has shown that only with some proper designs (e.g. random start) the FGSM-based training will be robust. This indicates that if not taken carefully, FGSM-based and stronger defense trainings (e.g. PGD-based adversarial training in Madry et al. (2018) ) can be very different in their resulting geometry of the decision boundary. Therefore, we opt for evaulating FGSM-based as well as the PGD-based defense in an attempt to reveal the relationship between the decision boundaries of a truly robust model and the isocapacitory saturation values. Details are given in Fig. 4 and the accompanying analysis.

C.5 TRAINING DETAILS

Training on the CIFAR-10 dataset. All training procedures used standard techniques for data augmentation such as flips, horizontal shifts and crops and were normed with respect to data mean and standard deviation. The training of the Wide-ResNets followed the framework provided by Cubuk et al. ( 2018) with weight decay 5e-4, batch size 128 and a decrease of the initial learning rate of 0.1 by a factor 0.2 at epochs 60, 120 and 160. The ResNets were trained with weight decay 1e-4 respectively and step wise decrease of the learning rate 0.1 by a factor 0.1 at epochs 100 and 150.Training on the MNIST dataset. We consider two models trained with various data augmentation techniques. We trained a LeNet-5 architecture LeCun et al. (1998) over 50 epochs with a learning rate 1e-3 and weight decay 5e-4, batch size of 64, while optimizing cross entropy loss using root mean square propagation. The same procedure was implemented to train a basic convolutional neural network consisting of four convolutional and two subsequent linear layers. While LeNet-5 also uses convolutional layers, it additionally uses max-pooling after each convolutional layer.Training on the planar toy dataset. We experimented with several 5-layer MLP models (each layer containing 20, 40, 70 or 100 hidden units) on the mentioned planar dataset concentrated along the circle of radius 5 centered at the origin. Training followed a straightforward ADAM optimization procedure with a learning rate of 1.0e-5 and batch size of 128.

C.6 DATA MANIPULATIONS DURING TRAINING

To evaluate how various training methods affect the geometric properties of the decision boundary, for all models we conduct three major types of training: training on clean data; on data with a layer of Gaussian perturbations with variance σ 2 = 0.4; finally, training on data with additional adversarial defense methods, where for each training example we add an adversarially chosen example to the dataset using the fast gradient sign method (FGSM). For LeNet-5 we also considered the effect of adversarial training, where the additional example is the result Brownian of random walk terminated upon collision with the decision boundary. See Fig . C .7 for a visual example of perturbations/attacks with the described methods. The resulting accuracies evaluated on the clean datasets for all trained models are shown in tables 1, 2, 3. As an additional benchmark of the trained models, we evaluated the the robustness of LeNet-5 architectures. Figure C.7 exhibits the resulting for the trained model's accuracies on clean data, PGD attacks with = 0.5 and = 1.0, Gaussian perturbations and fog with severity 4 according to the MNIST-C dataset Mu & Gilmer (2019) .

C.7 ISOCAPACITORY AND ISOPERIMETRIC RESULTS

Here we summarize the observations indicated by the obtained geometric data. Besides the results presented in the main text for models Wide-ResNet 28-10 and LeNet-5 (Fig. 4 ), we also considered geometric properties for said Residual Networks (CIFAR-10) (see Fig. 

