THE IMPLICIT BIAS OF MINIMA STABILITY IN MULTIVARIATE SHALLOW RELU NETWORKS

Abstract

We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univariate case, it was shown that linearly stable minima correspond to network functions (predictors), whose second derivative has a bounded weighted L 1 norm. Notably, the bound gets smaller as the step size increases, implying that training with a large step size leads to 'smoother' predictors. Here we generalize this result to the multivariate case, showing that a similar result applies to the Laplacian of the predictor. We demonstrate the tightness of our bound on the MNIST dataset, and show that it accurately captures the behavior of the solutions as a function of the step size. Additionally, we prove a depth separation result on the approximation power of ReLU networks corresponding to stable minima of the loss. Specifically, although shallow ReLU networks are universal approximators, we prove that stable shallow networks are not. Namely, there is a function that cannot be wellapproximated by stable single hidden-layer ReLU networks trained with a nonvanishing step size. This is while the same function can be realized as a stable two hidden-layer ReLU network. Finally, we prove that if a function is sufficiently smooth (in a Sobolev sense) then it can be approximated arbitrarily well using shallow ReLU networks that correspond to stable solutions of gradient descent.

1. INTRODUCTION

Neural networks (NNs) have been demonstrating phenomenal performance in a wide array of fields, from computer vision and speech processing to medical sciences. Modern networks are typically taken to be highly overparameterized. In such setting, the training loss usually has multiple global minima, which correspond to models that perfectly fit the training data. Some of those models are clearly sub-optimal in terms of generalization. Yet, the training process seems to consistently avoid those bad global minima, and somehow steer the model towards global minima that generalize well. A long line of works attributed this behavior to "implicit biases" of the training algorithms, e.g., (Zhang et al., 2017; Gunasekar et al., 2017; Soudry et al., 2018; Arora et al., 2019) . Recently, it has been recognized that a dominant factor affecting the implicit bias of gradient descent (GD) and stochastic gradient descent (SGD), is associated with dynamical stability. Roughly speaking, the dynamical stability of a minimum point refers to the ability of the optimizer to stably converge to that point. Particular research efforts have been devoted to understanding linear stability, namely the dynamical stability of the optimizer's linearized dynamics around the minimum (Wu et al., 2018; Nar & Sastry, 2018; Mulayoff et al., 2021; Ma & Ying, 2021) . For GD and SGD, it is well known that a minimum is linearly stable if the loss terrain is sufficiently flat w.r.t. the step size η. Concretely, a necessary condition for a minimum to be linearly stable for GD and SGD is that the top eigenvalue of the Hessian at that minimum point be smaller than 2/η (see Sec. 2). Although this condition only characterizes the linearized dynamics, it has been empirically shown to hold in realworld neural-network training (Cohen et al., 2020; Gilmer et al., 2022) . The linear stability condition turns out to have a strong effect on the nature of the network that is obtained upon convergence, both in terms of the end-to-end predictor function (Mulayoff et al., 2021) , and in terms of the way this function is implemented by the network (Mulayoff & Michaeli, 2020) . Mulayoff et al. (2021) studied how linear stability affects a single hidden-layer univariate ReLU network, when trained with the quadratic loss. They showed that in this setting, stable solutions of SGD with step size η correspond to functions f satisfying R |f ′′ (x)| g(x)dx ≤ 1 η - 1 2 , where f denotes the network input-output function, and g is a weight function that depends only on the training data. This result implies that for univariate shallow ReLU networks, SGD is biased towards 'smooth' solutionsfoot_0 . Moreover, the larger the step size η, the smoother the solution becomes. In this paper, we study the stable solutions of single hidden-layer ReLU networks with multidimensional inputs, trained using SGD and the quadratic loss. Particularly, in Sec. 3 we generalize the result of Mulayoff et al. (2021) to the multivariate setting. As it turns out, the natural extension of (1) involves the Radon transform of the Laplacian of the predictor function, ∆f (see Thm. 1). However, we show this result can also be interpreted in primal space as R d |∆f (x)|ρ(x)dx ≤ 1 η - 1 2 , ( ) where ρ is some weighting function. Thus, stable solutions of SGD in the multivariate case also correspond to smooth predictors (i.e., functions whose Laplacian has a small weighted L 1 norm). The larger the step size, the smoother the function becomes. Figure 1 illustrates this phenomenon. Additionally, we study the approximation power of single hidden-layer ReLU networks corresponding to stable minima. It is well known that shallow ReLU networks can approximate any continuous function over a compact set (Pinkus, 1999) . However this does not imply that SGD can stably converge to such approximations. If there exist functions whose approximations are all unstable, then this property may be of limited practical interest. In Sec. 4 we prove that every convergent sequence of stable networks has a limit function that also satisfies the stability condition (Thm. 1). Building on this, we prove a depth separation result. Specifically, we show that there exists a function that does not satisfy the stability condition for any positive step size. Namely, it cannot be stably approximated by a single hidden-layer ReLU network trained with a non-vanishing step size. Yet, the same function can be realized as a two hidden-layer ReLU network corresponding to a stable minimum. Moreover, in Sec. 5 we show that if a function is sufficiently smooth (Sobolev) then it can be approximated arbitrarily well using single hidden-layer ReLU networks that correspond to stable solutions of GD. Finally, in Sec. 3.3 and 6 we demonstrate our results. Particularly, we illustrate how our stable minima characterization (Thm. 1) can be used to predict certain properties of the solution. For example, for certain isotropic data (e.g., Gaussian), we show that a large step size tends to increase the biases of all neurons. We also demonstrate on the MNIST dataset the tightness of our stability bound, and that it predicts well the dependence of the stability and generalization performance on the step size. 2 BACKGROUND: MINIMA STABILITY OF SGD In this section we give a brief survey on minima stability. Let us consider the problem of minimizing an empirical loss using SGD. We are interested in the typical regime of overparameterized models. In this setting, there exist multiple global minimizers of the loss. Yet, SGD cannot stably converge to any minimum. The stability of a minimum is associated with the dynamics of SGD in its vicinity. Specifically, a minimum is said to be stable if once SGD arrives near the minimum, it stays in its vicinity. If SGD repels from the minimum, then we say that it is unstable. Formally, let ℓ j : R d → R be differentiable almost everywhere for all j ∈ [n]. Here we consider a loss function L and its stochastic analogue, L(θ) = 1 n n j=1 ℓ j (θ) and Lt (θ) = 1 B j∈Bt ℓ j (θ), where B t is a batch of size B sampled at iteration t. We assume that the batches {B t } are drawn uniformly from the dataset, independently across iterations. SGD's update rule is given by θ t+1 = θ t -η∇ Lt (θ t ), (4) where η is the step size. Analyzing the full dynamics of this system is intractable in most cases. Therefore, several works studied the behavior of this system near minima using linearized dynamics (Wu et al., 2018; Ma & Ying, 2021; Nar & Sastry, 2018; Mulayoff et al., 2021) , which is a common practice for characterizing the stability of nonlinear systems. Definition 1 (Linear stability). Let θ * be a twice differentiable minimum of L. Consider the linearized stochastic dynamical system θ t+1 = θ t -η ∇ Lt (θ * ) + ∇ 2 Lt (θ * )(θ t -θ * ) . Then θ * is ε linearly stable if for any θ 0 in the ε-ball B ε (θ * ), we have lim sup t→∞ E[∥θ t -θ * ∥] ≤ ε. Namely, a minimum is ε linearly stable if once θ t enters an ε-ball around the minimum, it ends up at a distance no greater than ε from it in expectation. Under mild conditions, any stable minimum of the nonlinear system is also linearly stable (Vidyasagar, 2002, p. 268) . We have the following condition. Lemma 1 (Necessary condition for linear stability (Mulayoff et al., 2021 , Lemma 1)). Consider SGD with step size η, where batches are drawn uniformly from the training set, independently across iterations. If θ * is an ε linearly stable minimum of L, then λ max ∇ 2 L(θ * ) ≤ 2 η . This condition states that stable minima of SGD are flat w.r.t. the step size. Although this result was proved for the linearized dynamics, it was observed to hold also in practice, where the full nonlinear dynamics apply. Particularly, much empirical evidence on real-world neural-network training (Cohen et al., 2020; Gilmer et al., 2022) points out that GD and SGD converge only to linearly stable minima, i.e., minima satisfying (6). More on dynamical stability and its interaction with common practices (e.g., learning rate decay, absence of ε and B in Lemma 1 result, etc.) in App. A.

3. LARGE STEP SIZE BIASES TO SMOOTH FUNCTIONS

Consider the set of multivariate functions over R d that can be implemented by a single hidden-layer ReLU network with k neurons, F k ≜ f : R d → R f (x) = k i=1 w (2) i σ x ⊤ w (1) i + b (1) i + b (2) , where σ(•) denotes the ReLU activation function. Each f ∈ F k is a piecewise linear function with at most k knots 2 . Given some training set {x j , y j } n j=1 , we are interested in functions that globally minimize the quadratic lossfoot_2  L(f ) ≜ 1 2n n j=1 f (x j ) -y j 2 . (8) Definition 2 (Solution). A function f ∈ F k is a 'solution' if L(f ) = 0, i.e., f (x j ) = y j ∀j ∈ [n]. We focus on the overparameterized regime (kd > n) in which there exist multiple solutions. We want to study the properties of solutions which correspond to stable minima of SGD. However, a key challenge is that any solution f ∈ F k typically has infinitely many different parameterizations. In other words, there are various parameter vectors θ ≜ w (1)⊤ 1 • • • w (1)⊤ k b (1)⊤ w (2)⊤ b 2 ⊤ ∈ R (d+2)k+1 , that can implement the same function f . Different parameterizations correspond to different minima, which may have different Hessian eigenvalues. Therefore, for a given step size η, some parameterizations of f may be stable while others may not. Thus, to determine whether SGD can stably converge to a solution f , we need to check whether there exists some stable minimum θ, which corresponds to a parametrization of f . We therefore use the following definition. Definition 3 (Stable solution). A solution f ∈ F k is said to be stable for step size η if there exists a minimum θ * of the loss that corresponds to f , where θ * is linearly stable for SGD with step size η. The next theorem characterizes stable solutions using the Radon transform R (see App. C) and the Laplace operator ∆. Particularly, we use the inverse of the dual Radon transform, (R * ) -1 , and interpret ∆f in the weak sense, i.e., as a sum of weighted Dirac delta functions (see App. D). Theorem 1 (Properties of stable solutions). Let f be a linearly stable solution for SGD with step size η. Assume that the knots of f do not coincide with any training point. Then ∥f ∥ R,g ≤ 1 η - 1 2 , ( ) where ∥•∥ R,g is the stability norm, defined as ∥f ∥ R,g ≜ S d-1 ×R (R * ) -1 ∆f (v, b) g(v, b)ds(v)db, and g(v, b) ≜ min g(v, b), g(-v, -b) is a non-negative weighting function, with g given by g(v, b) ≜ P 2 (X ⊤ v > b)E X ⊤ v -b X ⊤ v > b E X X ⊤ v > b 2 + 1. ( ) Here X is a random vector drawn from the dataset's distribution (i.e., sampled uniformly from {x j }). This theorem, whose proof is provided in App. E, shows that the step size constrains the stability norm of the solution. Notably, the constraint becomes stricter as the step size increases. Before interpreting this result, let us note that although it depends only on the step size, other hyper-parameters (e.g., batch size, initialization) may potentially improve the bound. Yet, as we discuss in App. A, the effect of other hyper-parameters seems secondary in practical settings. The implications of Thm. 1 can be understood in primal space and in Radon space. In the following, we discuss both interpretations and give examples.

3.1. PRIMAL SPACE INTERPRETATION

Theorem 1 is stated in Radon space, which may be difficult to interpret. However, in some cases it can also be interpreted in primal space, by deriving an alternative form for the stability norm ∥ • ∥ R,g . Specifically, in App. G we show that if g is piecewise continuous and L 1 -integrablefoot_3 , then for all f ∈ F k and ρ = R -1 g we havefoot_4 In this presentation of the stability norm, ρ is not necessarily non-negative. Nevertheless, all its hyper-plane integrals are non-negative, since Rρ = g ≥ 0. Thus, the stability norm can be interpreted as a non-negative linear combination of hyper-plane integrals of ρ along the knots of f . This is visualized in Fig. 2 . Hence, Thm. 1 combined with (13) implies that the larger the step size η, the smoother the solution becomes. ∥f ∥ R,g = R d |∆f (x)|ρ(x)dx.

3.2. RADON SPACE INTERPRETATION

Another interesting interpretation of Thm. 1 can be derived in Radon space. First, let us examine how the weight function g(v, b) behaves as a function of b. For every fixed v, the function g(v, •) has a finite support, [min j {x ⊤ j v}, max j {x ⊤ j v}]. Moreover, g(v, •) typically has most of its mass concentrated around the center of the distribution of the projected data points {x ⊤ j v}, and it decays towards the endpoints (see e.g., Fig. 3 ). Next, let us interpret how the term (R * ) -1 ∆f behaves. For a single hidden-layer ReLU network, (R * ) -1 ∆f is a sum of Dirac deltas. Specifically, as shown in (Ongie et al., 2020)  , if f is a function of the form f (x) = k i=1 a i σ(v ⊤ i x -b i ) + c with ∥v i ∥ 2 = 1 for all i ∈ [k], then (see App. F.3) (R * ) -1 ∆f = k i=1 a i δ (vi,bi) , ( ) where ∆f is the (distributional) Laplacian of f , and δ (v,b) denotes a Dirac delta centered at (v, b) ∈ S d-1 × R. We can thus define a parameter space representation for the stability norm as (see App. F.3) S θ ≜ k i=1 |a i | g (v i , b i ) . Generally, this parametric representation of the stability norm satisfies ∥f ∥ R,g ≤ S θ , where equality happens whenever the ReLU knots of the representation do not coincide (i.e., there is one Dirac function for each ReLU unit). Yet, this parametric view of the stability norm also obeys (see App. F.3) S θ ≤ 1 η - 1 2 . ( ) Hence, larger step sizes η push S θ to be smaller, and from (15) we see that |a i | will tend to be small. Also, since g(v, •) typically decays towards the boundary of its support, this pushes the neurons' biases, b i , away from the center of the distribution. The resulting effect is that the predictor function f becomes flatter, especially near the center of the distribution. This is illustrated in Figs. 1 and 4(b).

3.3. EXAMPLES

Earlier we introduced two interpretations for the stability norm ∥•∥ R,g : one in primal space, which uses the weight function ρ, and one in Radon space, which uses the weight function g. In this section, we compute g and ρ for two toy examples, for which (13) holds. Example 1: Two data points in R 2 . Assume the dataset contains two points: x 1 = (1, 0) and x 2 = (-1, 0). In this case we can analytically calculate g and ρ (see App. H.1). Figure 3 depicts these functions. Here, ρ has singularities at x 1 , x 2 and at the origin. Yet, despite these singularities, all line integrals of ρ are finite and thus the expression R d |∆f |(x) ρ(x) dx is well-defined. Moreover, while ρ takes negative values, all its line integrals take positive values. Example 2: Isotropic distribution. Suppose the data is isotropically distributed, i.e., P(X ⊤ v > b) does not depend on the direction of v. Thus g is independent of v, which implies that ρ = R -1 g is a radial function. In App. H.2 we give an analytic expression for g for any isotropic distribution. In the special case of X ∼ N (0, I), g(v, •) decays monotonically with b, and thus, as discussed in Sec. 3.2, large step sizes will tend to increase the biases of all neurons. Additionally, we show in App. H.2 that for 2D data, ρ is positive and strictly decreasing in ∥x∥, and it satisfies the asymptotics ρ(x) = O(log(∥x∥)) as ∥x∥ → 0, and ρ(x) = O(∥x∥ -1 ) as ∥x∥ → ∞. Figure 3 visualizes g and ρ for two dimensional Gaussian data.

4. STABILITY LEADS TO DEPTH SEPARATION

Single hidden-layer neural networks are universal approximators, i.e., they can approximate arbitrarily well any continuous function over compact sets (Pinkus, 1999) . However, some of these approximations may correspond to unstable minima that are virtually unreachable by training via SGD. To understand what is the effective approximation power of neural networks, we need to identify the class of functions that have stable approximations. We have the following (see proof in App. I.1). Proposition 1. Let X be the interior of the convex hull of the training points, and f : X → R be any function. Suppose there exists a sequence of single hidden-layer ReLU networks {f k } with bounded stability norm that converges to f in L 1 over X. Then ∥f ∥ R,g is finite, and lim k→∞ ∥f k ∥ R,g = ∥f ∥ R,g . Let {f k } be a convergent sequence of stable solutions with a growing number of knots, i.e., ∀f k ∈ F k : ∥f k ∥ R,g ≤ 1/η -1/2 (see Thm. 1). Then, by the proposition above we have that the limit function f also satisfies this inequality. Therefore, the effective class of functions that can be approximated arbitrarily well by single hidden-layer ReLU networks includes only continuous functions f that satisfy the stability condition ∥f ∥ R,g ≤ 1/η -1/2. As the step size decreases, more functions satisfy this condition, suggesting that more functions can be stably approximated by single hidden-layer ReLU networks. Surprisingly, there exists at least one continuous function p that has ∥p∥ R,g = ∞ and therefore does not satisfy the stability condition for any positive step size (see proof in App. I.2). Therefore from Prop. 1 and Thm. 1, this function cannot be approximated arbitrarily well by single hidden-layer ReLU networks trained with a non-vanishing step size. Proposition 2. Assume the input dimension d ≥ 2, and let p(x) = σ(1 -∥x∥ 1 ). Suppose the support of p is contained in the interior of the convex hull of the training points. Then ∥p∥ R,g = ∞. Intriguingly, this function does have an implementation as a finite-width two hidden-layer network, p(x) = σ(1 - d i=1 (σ(x i ) + σ(-x i ))) , which is a stable solution for a fixed step size. Indeed, in App. I.3 we demonstrate that for an appropriate choice of η, GD is able to converge to this implementation. Thus, we have a depth separation result: the function p cannot be approximated by stable minima of one hidden-layer networks trained with a non-vanishing step size, yet with two hidden-layers, GD can converge to this function with a fixed step size.

5. SHALLOW NETWORK APPROXIMATIONS OF SMOOTH FUNCTIONS

In Sec. 4 we showed that stable single-hidden layer ReLU networks are not universal approximators. In this section we give an approximation guarantee under smoothness assumptions. That is, we show that if a function is sufficiently smooth, then it can be approximated arbitrarily well using single hidden-layer networks that correspond to stable solutions of GD. Let W d+1,1 w (R d ) denote the weighted Sobolev space of all functions whose weak partial derivatives up to order d+1 are bounded in a weighted L 1 -norm ∥•∥ 1,w with weight function w(x) := R * [1+|b|](x). Let ∥ • ∥ W d+1,1 w (R d ) denote the corresponding Sobolev norm ∥f ∥ W d+1,1 w (R d ) = ∥f ∥ 1,w + d+1 k=1 |β|=k ∂ β f 1,w , where β is a multi-index.  } such that f k ∈ F k converges to f in L 1 over any compact subset K ⊂ R d , i.e., lim k→∞ K |f k (x) -f (x)|dx = 0 , and satisfies the bounds ∥f k ∥ R + ∥f k ∥ R,ĝ ≤ c d,ĝ ∥f ∥ W d+1,1 w (R d ) for all k, where ĝ(v, b) = P X ⊤ v > b E X ⊤ v -b 2 X ⊤ v > b 1 + E ∥X∥ 2 X ⊤ v > b , and c d,ĝ is a constant depending on d and ĝ but independent of f . Here X is drawn uniformly at random from the dataset. This proposition shows that for any f ∈ W d+1,1 w (R d ) there exists a sequence of single hidden-layer ReLU network approximations {f k } for which {∥f k ∥ R } and {∥f k ∥ R,ĝ } are bounded. To prove that these functions can have stable parameterizations for GD, we need to show that if both the stability norm and R-norm are bounded (a function space property), then there exists a corresponding minimum with bounded sharpnessfoot_5 in parameter space. To this end, we derive an upper bound on the minimal sharpness of a solution f among its different parameterizations in terms of the stability norm and the R-norm (see proof in App. K). Lemma 2. Let f ∈ F k be a solution for which the knots do not coincide with any training point. Then there exists an implementation θ * corresponding to f such that λ max ∇ 2 L(θ * ) ≤ 1 + 2 ∥f ∥ R,ĝ + 4 ∥f ∥ R + inf x∈R d ∥∇f (x)∥ λ max Σ X 1 + E ∥X∥ 2 . (19) Here X is drawn uniformly at random from the dataset, and Σ X is the covariance matrix of X. Combining Prop. 3 and Lemma 2 we get that any f ∈ W d+1,1 w (R d ) can be approximated arbitrarily well by a sequence of stable solutions for GD with a fixed step size η. Theorem 2. Suppose the input dimension d is odd, and let f ∈ W d+1,1 w (R d ). Then, there exist η > 0 and a sequence of single hidden-layer ReLU network functions {f k } such that f k ∈ F k converges to f in L 1 over any compact subset K ⊂ R d , and every f k is stable for GD with step size η. This theorem state that any sufficiently smooth function can be stably approximated in the limit of infinitely many neurons. We can also use Lemma 2 to guarantee the stability of solutions in the finite case. Since λ max ≤ 2/η is a sufficient condition for stability in GD, we have the following. Theorem 3. Let f ∈ F k be a solution for which the knots do not coincide with any training point. If ∥f ∥ R,ĝ + 2 ∥f ∥ R + inf x∈R d ∥∇f (x)∥ λ max Σ X 1 + E ∥X∥ 2 ≤ 1 η - 1 2 , ( ) then f is a stable solution for GD with step size η. Theorem 3 complements Thm. 1, as it gives a sufficient condition for stability in function space. Published as a conference paper at ICLR 2023 Here we see that the bias vector grows with the step size, as the predictor function gets smoother.

6. EXPERIMENTS

We now demonstrate our theoretical results. We start with a regression task on synthetic data. Here, we drew n = 100 pairs (x j , y j ) in R 20 × R from the standard normal distribution to serve as our training set. We fit a single hidden-layer ReLU network with k = 40 neurons to the data using GD with various step sizes (runs were stopped when the loss dropped bellow 10 -8 ). For each minimum θ * to which GD converged, we computed the loss' sharpness, λ max (∇ 2 L(θ * )), and our lower bound on the sharpness, 1 + 2 ∥f ∥ R,g (Lemma 3 in the appendix). Additionally, we numerically determined the sharpness of the flattest implementation for every solution. Figure 4 (a) depicts the results for this experiment. The red line marks the border of the stable region, which is 2/η. Namely, (S)GD cannot stably converge to a minimum whose sharpness is above this line. The dashed yellow line shows the sharpness of the minima to which GD converged in practice. As can be seen, here GD converged at the edge of stability (the two lines coincide), a phenomenon discussed in (Cohen et al., 2020) . The blue curve is the sharpness of the flattest implementation of each solution (see App. A), while the the purple curve is our lower bound. We see that our bound is quite tight (blue vs. purple). Furthermore, as the step size increases the minima get flatter in parameter space (yellow curve), which translates to smoother predictors in function space (purple curve). Additionally, we see from Fig. 4 (b) that the norm of the bias vector b increases with the step size, as our theory predicts (Sec. 3.2). Next, we present an experiment with binary classification on MNIST (LeCun, 1998) using SGD. In this experiment we used n = 512 samples from two MNIST classes, '0' and '1'. The classes were labeled as y = 1 and y = -1, respectively. For the validation set we used 4000 images from the remaining samples in each class. We trained a single hidden-layer ReLU network with k = 200 neurons using SGD with batch size B = 16, and the quadratic loss. To perform classification at inference time, we thresholded the net's output at 0. We ran SGD until the loss dropped below 10 -8 for 2000 consecutive epochs. Figure 5 (a) shows the same indices as in the previous experiment. Here we see again that as the step size increases, the minima get flatter in parameter space (yellow curve), which translates to smoother predictors in function space (purple curve). Figure 5 (b) shows the classification accuracy on the validation set, where we see that the network generalizes better as the step size increases, as past work showed, e.g., (Keskar et al., 2017) . More experiments in App. N.

7. RELATED WORK

Dynamical stability analysis was applied to neural network training in several works. In particular, Nar & Sastry (2018) analyzed Lyapunov stability of one hidden-layer ReLU networks without bias. They proved a bound on the network's output which depends on the step size and training data and implies that the network's output should be smaller for training samples with larger magnitude. Mulayoff & Michaeli (2020) characterized the flattest minima for linear nets and showed that these minima have unique properties. Yet, in their setting all minima implement the same input-output function. Thus, their results only show that SGD is biased toward certain implementations of the same function, whereas our result shows that SGD is biased toward certain functions. Wu et al. ( 2018) proved a sufficient condition for dynamical stability of SGD in expectation using second moment. Ma & Ying (2021) extended their result by showing a necessary and sufficient condition for dynamical stability in higher moments. In addition, the authors combined this condition with the multiplicative structure of neural nets to prove an upper bound on the Sobolev seminorm of the model's input-output function at stable interpolating solutions. Their upper bound extends to deep nets, yet it depends on the norm of the first layer of the network, which in general can be large. Mulayoff et al. (2021) characterized the stable solutions of SGD for univariate single hidden-layer ReLU networks with the square loss. Our Thm. 1 is the natural extension of (Mulayoff et al., 2021, Thm. 1) to the multivariate case. To prove it, we combine the proof technique of lower bounding the top eigenvalue of the Hessian, used by Mulayoff et al. (2021) , with the Radon transforms analysis used by Ongie et al. (2020) . Combining these techniques is not a priori trivial, since Radon transform was not used before for Hessian analysis. Also, it required several subtle steps that are not encountered in the univariate setting nor in (Ongie et al., 2020 ) (e.g., working with the inverse of the dual Radon transform to obtain the primal space representation). Ongie et al. (2020) studied the space of functions realizable as infinite-width single hidden-layer ReLU nets with bounded weights norm. Their settings assumes explicit regularization, i.e., min-norm solution, whereas here we derived our results for SGD without regularization, via implicit bias. On the technical level, they introduced the "R-norm" ∥ • ∥ R that is closely related to the stability norm. Particularly, ∥ • ∥ R = ∥ • ∥ R,g , for g = 1 the constant 1 function. They proved similar results of depth separation and approximation guarantees, shown here in Secs. 4-5. More related work in App. B.

8. CONCLUSION

Large step sizes are often used to improve generalization (Li et al., 2019) . This work suggests an explanation to this practice. Specifically, we showed that large step sizes lead to smaller stability norm and thus can bias towards smooth predictors in shallow multivariate ReLU networks. We find the smoothness measure depends on the data via specific functions g or ρ, and exemplify their properties. Moreover, we studied the approximation power of ReLU networks that correspond to stable solutions. Although shallow networks are universal approximators, we proved that stable solutions of these networks are not. Namely, there is a function that cannot be well-approximated by stable single hidden-layer ReLU networks trained with a non-vanishing step size. Yet we showed that the same function can be realized as a stable two hidden-layer network, leading to a depth separation result. This result can explain the success of deep models over shallow ones. Finally, we gave approximation guarantees for stable shallow ReLU networks. In particular, we proved that any Sobolev function can be approximated arbitrarily well using GD with single hidden-layer ReLU networks.

A ADDITIONAL DISCUSSION

The independence of Lemma 1 and Theorem 1 on the batch size B. Theorem 1 relies on Lemma 1 (see App. E), which was proved in (Mulayoff et al., 2021) . This lemma states that if a minimum θ * is linearly stable for SGD with batch-size B, then the Hessian H of the loss at θ * must satisfy λ max (H) ≤ 2/η, where η is the step-size. Importantly, this necessary condition holds true for any batch size B, and thus Theorem 1 is independent of B. We note, however, that the precise stability threshold of SGD might depend on B, yet the important points to notice are: (1) Here all we need is a necessary condition, and Lemma 1 provides a simple bound that holds for any B. (2) Empirical evidence shows that there is not much room for improvement upon this batch-size-independent bound in real-world settings. Specifically, for practical batch sizes, the gap between 2/η and the stability threshold of SGD is often very small (see Fig. 5 in our paper and Figures 2-3 in (Gilmer et al., 2022) ). The proof of Lemma 1 is actually quite short and easy to follow (see (Mulayoff et al., 2021, App. II) ). The idea is that if θ * is an ε linearly stable minimum, then by definition we have lim sup t→∞ E[∥θ t -θ * ∥] ≤ ε, where {θ t } ∞ t=0 are governed by the linearized stochastic dynamics given in Eq. ( 5). Using Jensen's inequality, for all t > 0 we get ∥E[θ t ] -θ * ∥ ≤ E[∥θ t -θ * ∥]. Thus, lim sup t→∞ ∥E[θ t ] -θ * ∥ ≤ lim sup t→∞ E[∥θ t -θ * ∥] ≤ ε. ( ) Note that under the linearized dynamics, {E[θ t ]} ∞ t=0 are precisely GD steps. Therefore, we have that if θ * is linearly stable for SGD, then it must be linearly stable also for GD. Now, a well-known fact is that θ * is linearly stable for GD if and only if λ max (H) ≤ 2/η . This is how we get the necessary condition in Lemma 1, which does not depend on the batch size. The independence of Lemma 1 and Theorem 1 on ε. Lemma 1 states that a necessary condition for a twice differentiable minimum to be ε linearly stable is λ max ∇ 2 L(θ * ) ≤ 2/η. That is, the condition does not depend on ε which might seem not intuitive. However, the reason Lemma 1 does not depend on ε is because it refers to linear stability, opposed to non-linear dynamical stability. In linear stability for twice-differentiable minima, all we care about is the second-order Taylor approximation of the loss at the minimum. In see previous paragraph we explained that Lemma 1 gives a necessary condition through reduction to GD. Now, when applying GD on a quadratic loss (with a PSD matrix), for any ε > 0 only one of two things can happen: 1. Either ∃θ 0 ∈ B ε (θ * ) : lim sup t→∞ ∥θ t -θ * ∥ = +∞ (unstable for any ε > 0), 2. or ∀θ 0 ∈ B ε (θ * ) : lim sup t→∞ ∥θ t -θ * ∥ ≤ ∥θ 0 -θ * ∥ ≤ ε (stable for any ε > 0). In any outcome, the result does not depend on ε, and therefore ε does not appear in the result of Lemma 1. Note that for non-differentiable minima, ε does affect linear stability in GD, however Lemma 1 only refers to twice-differentiable minima. Theorem 1 is based on Lemma 1, and therefore does not depend on ε. Yet, beyond this technical reasoning, it is important to note that here we consider interpolating solutions. For those solutions, the global minimum of the loss is also a global minimum w.r.t. each data sample (x j , y j ) separately. Therefore, despite the stochasticity of SGD, every step points towards a global minimum. This implies that if the stability criterion is satisfied, then SGD converges to the minimum (lim sup t→∞ E[∥θ t -θ * ∥] = 0 ) and if it is not satisfied, then SGD repels from the minimum (lim sup t→∞ E[∥θ t -θ * ∥] = ∞ for the linearized dynamics). This is also seen in simulations where models are overfit to training data using SGD, e.g., (Ma et al., 2018b) . Particularly, in our simulations the loss always converged to 0 when it converged (we arbitrarily decided to stop each run when the loss dropped below 10 -8 ). In the general case of non-interpolating solutions, the expected final distance to the minimum in mini-batch SGD (lim sup t→∞ E[∥θ t -θ * ∥]) can be a strictly positive finite number, and therefore in those cases ε does play a role. Large step size training and warmup. While Theorem 1 applies to any positive step size, it is most interesting when considering large step sizes. High learning rates are standard practice, as they are associated with good generalization (Li et al., 2019) . However, there are cases, e.g., large initialization, in which high learning rate might cause the training to diverge. In these cases, a learning rate warmup is applied, enabling training with large step sizes. Learning rate decay. Practitioners often work with learning rate schedule, which typically reduces the step size toward the end of training. In this scenario, 2/η can be quite high at the end, making Theorem 1 loose. Here, empirical evidence shows that when reducing the step size at a late stage, the sharpness of the obtained minimum is often still controlled by the initial step size, e.g., Figs. 1 and 3 in (Gilmer et al., 2022) . Moreover, although learning rate decay is a popular technique, there are other popular training schemes in which the learning rate is not reduced, e.g., (Smith et al., 2018) . Lastly, as for the depth separation results (Sec. 4) and the approximation results (Sec. 5), they apply for any fixed positive step size. In other words, the learning rate decay does not affect these results. Initialization independent results. Our results are independent of the initialization. In the past it was shown that initialization can have large effect on which minimum GD converges under certain conditions. However, this do not contradict our results, as explained below. For very small step sizes, the GD trajectory follows that of gradient flow (GF). Under certain conditions, e.g., infinite width or vanishing initialization, it was shown that the network does not change much on the evolution trajectory of GF. In this case, the initialization dominates the properties of the obtained solution. This is known as kernel regime or Neural Tangent Kernel (NTK) regime (Jacot et al., 2018; Chizat et al., 2019) . However, for practical step sizes and standard initialization, recent work (Cohen et al., 2020) showed that GD typically deviates from the GF trajectory, while entering the Edge of Stability regime. This occurs when the stability threshold is achieved during training, i.e., λ max (∇ 2 L(θ t )) ≥ 2/η for some t > 0. In this case, GD converges to a different minimum than GF, i.e., GD escapes the NTK regime. Similar behavior was shown also for SGD (Gilmer et al., 2022) . Theorem 1 Proof idea. Theorem 1 is a result of two properties of twice-differentiable minima. First, we show in Lemma 3 in the appendix that these minima satisfy λ max (∇ 2 θ L) ≥ 1 + 2 ∥f ∥ R,g . Second, we know from Lemma 1 that stable minima satisfy λ max (∇ 2 θ L) ≤ 2/η. Together, these properties imply that 2/η ≥ 1 + 2 ∥f ∥ R,g , from which we get the result of the theorem, ∥f ∥ R,g ≤ 1/η -1/2. Note that twice-differentiable minima correspond to functions whose knots do not coincide with any training point. Although we prove our result only for such functions, we observed that in practice the condition λ max (∇ 2 θ L) ≤ 2/η is always met around minima to which SGD converges (see Sec. 6 and the next paragraph). For full derivation of the theorem see App. E. Theorem 1 assumption and minima that are not twice-differentiable. Theorem 1 assumes that the knots of f do not coincide with any training point. This assumption is done for technical simplicity, that is to ensure the minimum is twice-differentiable. In practice, we observed that usually only a small fraction of the knots coincide with training points. For example, in our MNIST experiment we had only 30 training points coinciding with knots of f (out of n = 512 training points and k = 200 neurons). Note that the twice-differentiability assumption in Theorem 1 is required for Lemma 1 to hold as Theorem 1 invokes Lemma 1. However, we observed that in practical settings, Lemma 1 still applies in settings where the assumption is violated. This can be appreciated by the red curve upper bounding the orange curve in Figs. 4(a) and 5(a). Namely, despite the fact that the minima are not always twice-differentiable in our experiments, the stability criterion for SGD is still observed to be upper bounded by 2/η. It is possible to extend the analysis to minima that are not twice-differentiable by using the same method as in (Mulayoff et al., 2021) . However, this makes the analysis much more complex. The meaning of min λ max in figures 4(a) and 5(a). The curve min λ max shows the sharpness of the flattest implementation of each solution f . In more detail, as discussed earlier, Theorem 1 is a result of two properties of twice-differentiable minima. On the one hand, we know from Lemma 1 that stable minima satisfy λ max (∇ 2 θ L) ≤ 2/η. On the other hand, Lemma 3 in App. E asserts that these minima satisfy λ max (∇ 2 θ L) ≥ 1 + 2∥f ∥ R,g . Note that each function f ∈ F k has multiple implementations, i.e., different minima in parameter space which all correspond to f . These minima can have different sharpness. Here, we can look at the best implementation, i.e., a solution to min λ max , where the minimum is taken over all loss' minima {θ} that implement f . Overall, given a minimum θ with a corresponding function f , we have 1 + 2∥f ∥ R,g ≤ min λ max (∇ 2 θ L) ≤ λ max (∇ 2 θ L) ≤ 2/η. ( ) These inequalities give us the result of Theorem 1, 1 + 2∥f ∥ R,g ≤ 1/η -1/2. To understand the tightness of each part of our analysis, we added to the plots λ max and min λ max . In both figures, λ max (∇ 2 θ L) equals or just below 2/η, a phenomenon known as edge of stability (Cohen et al., 2020) . Additionally, min λ max (∇ 2 θ L) is close to 1 + 2∥f ∥ R,g in these experiments. Yet, Figure 4 shows that λ max (∇ 2 θ L) can be quite larger than min λ max (∇ 2 θ L), meaning that there exists a far flatter minimum that implements the same function. This fact was used by Dinh et al. (2017) to show that sharp minima can generalize.

B ADDITIONAL RELATED WORK

Implicit bias. A long line of works studied the implicit bias of the training procedure in an attempt to better understand generalization in overparameterized models. For the classification setting, in the case of linear prediction function, linearly separable data, and exponentially tailed loss functions (e.g., logistic and exponential), Soudry et al. (2018) showed that GD converges in the direction of the SVM solution. This result was later extended to linear fully connected and convolutional neural networks (Gunasekar et al., 2018b; Ji & Telgarsky, 2019a) , more loss functions (Nacson et al., 2019b; Ji & Telgarsky, 2021) , SGD optimization algorithm (Nacson et al., 2019c) , other generic optimization methods (Gunasekar et al., 2018a) , non-separable data (Ji & Telgarsky, 2019b) , and homogeneous prediction functions (Nacson et al., 2019a; Lyu & Li, 2020; Ji et al., 2020) . However, all those results do not depend on the step size, except for the requirement that it be sufficiently small. Another line of works studied the implicit bias in the context of linear models with quadratic loss such as matrix factorization (Gunasekar et al., 2017; Li et al., 2018; Arora et al., 2018; 2019; Belabbas, 2020; Eftekhari & Zygalakis, 2021; Gidel et al., 2019; Ma et al., 2018a; Woodworth et al., 2020; Azulay et al., 2021) . However, all of these works relied on either small or infinitesimal step size (i.e., gradient flow). Thus, they do not capture how the step size affects the implicit bias. Moreover, they assumed a manifold property (Azulay et al., 2021) . As pointed out by Razin & Cohen (2020) and Vardi & Shamir (2021) , these assumptions do not always apply. In contrast, our result is based on a stability condition of SGD, which depends on the step size and does not require the manifold assumption. How the step size affects the implicit bias. To investigate the implicit bias of the step size, Barrett & Dherin (2021) and Smith et al. (2021) suggested using a modified loss. Under this modified loss, gradient flow approximates the trajectory of (S)GD on the original loss. However, the step size should be sufficiently small for the approximation to hold true. Moreover, the induced regularization term this method yields is expressed in terms of the model's parameters. Additionally, this term increases linearly with the step size and vanishes at any stationary point. Radon transform analysis of shallow networks. Radon transform analysis has previously been used in studies of the approximation capabilities of single hidden-layer neural networks with bounded activation functions (Carroll & Dickinson, 1989; Ito, 1991) , and more general activation functions in the ridgelet framework (Candès & Donoho, 1999; Candès, 1999) . More recently, Sonoda & Murata (2017) used ridgelet transform analysis to study the approximation properties of two-layer neural networks with unbounded activation functions, including the ReLU. Parts of this work extend results by Ongie et al. (2020) , which defined a similar Radon-domain seminorm (the "R-norm") to determine the space of functions realizable as an infinite-width single hidden-layer ReLU networks with square-summable weights. Parhi & Nowak (2021) proved a representer theorem for single hidden-layer ReLU networks using the R-norm. Finally, an L 2 version of the R-norm is used by Jin & Montúfar (2020) to describe the function space implicit bias of training a single hidden-layer ReLU network using gradient descent in the neural tangent kernel regime.

C THE RADON TRANSFORM

For a function f : R d → R, the d-dimensional Radon transform Rf is the collection of all integrals of f over (d -1)-dimensional affine hyperplanes in R d . Every hyperplane can be parametrized by a pair (v, b) ∈ S d-1 × R, where v is a unit normal to the hyperplane and b ∈ R is its distance from the origin. Therefore, the Radon transform Rf is the function over (v, b) ∈ S d-1 × R given by Rf (v, b) ≜ v ⊤ x=b f (x)ds(x), where ds(x) represents integration with respect to the The dual Radon transform R * maps functions defined on S d-1 × R to functions on R d by R * φ (x) ≜ S d-1 φ v, v ⊤ x ds(v) for all x ∈ R d , where ds(v) represents integration with respect to the (d -1)-dimensional surface measure on the unit sphere S d-1 . The Radon transform and its dual are invertible over spaces of smooth functions via the inversion formulas: R -1 = γ d (-∆) d-1 2 R * , (R * ) -1 = γ d R(-∆) d-1 2 , where the fractional Laplacian operator (-∆) d-1 2 is defined by application of a ramp function in Fourier domain (i.e., multiplication by ∥ω∥ d-1 in Fourier domain), and γ d = 1 2(2π) d-1 (28) is a dimension dependent constant. These transforms may be extended to spaces of distributions (e.g., Dirac deltas) in a standard way (Ludwig, 1966; Helgason, 1999) , which we summarize in App. D. Important for this work is the distributional dual inverse Radon transform (R * ) -1 , which maps a distribution defined over Euclidean space R d to a distribution in Radon domain S d-1 × R.

D DISTRIBUTIONAL FRAMEWORK

Let f : R d → R be any locally integrable function. Then its Laplacian ∆f can be interpreted as a tempered distribution, meaning that ∆f is defined via the duality pairing ⟨∆f, φ⟩ ≜ ⟨f, ∆φ⟩ R d = R d f (x) ∆φ(x) dx, ( ) where φ is any Schwartz test function on R d , i.e., a smooth function such that the function and its partial derivatives of all orders have sufficiently fast decay at infinity; denote this space of functions by S(R d ). For example, if f consists of a single ReLU unit, i.e., f (x) = σ(v ⊤ x -b) such that ∥v∥ = 1, then it is easy to show ∆f = δ(v ⊤ x -b), meaning ⟨∆f, φ⟩ = {x:v ⊤ x=b} φ(x)dx = Rφ(v, b). In other words, ∆f is the distribution given by evaluation of the Radon transform of a test function at the point (v, b) ∈ S d-1 × R. Next, we describe how to understand the operator (R * ) -1 in a distributional sense. Let S H (S d-1 ×R) denote the image of Schwartz functions S(R d ) under the classical Radon transform R. The space S H (S d-1 ×R) is characterized in (Ludwig, 1966; Helgason, 1999) ; it is the space of all even Schwartz functions defined on S d-1 × R that additionally satisfy some moment conditions 7 . It is also shown by Ludwig (1966) that the classical inverse Radon transform R -1 is a linear homeomorphism of S H (S d-1 × R) onto S(R d ). Therefore, we may define its distributional transpose (R -1 ) * = (R * ) -1 applied to any tempered distribution h by ⟨(R * ) -1 h, ϕ⟩ = ⟨h, R -1 ϕ⟩ (30) for all ϕ ∈ S H (S d-1 × R). Note that (R * ) -1 h is a distribution belonging to S ′ H (S d-1 × R), the topological dual of S H (S d-1 × R). Returning to the example where f is a single ReLU unit, i.e., f (x) = σ(v ⊤ x -b) with ∥v∥ = 1, then for any test function ϕ ∈ S H (S d-1 × R) we have ⟨(R * ) -1 ∆f, ϕ⟩ = ⟨∆f, R -1 ϕ⟩ = [RR -1 ϕ](v, b) = ϕ(v, b). ( ) This shows (R * ) -1 ∆f = δ (v,b) , i.e., a Dirac delta centered at (v, b). If f (x) = k i=1 a i σ(v ⊤ i x - b i ) + c is any single hidden-layer ReLU network such that ∥v i ∥ = 1 for all i = 1, ..., k, then by linearity we have (R * ) -1 ∆f = k i=1 a i δ (vi,bi) . Finally, we may define the total variation ∥ • ∥ TV for any distribution α ∈ S ′ H (S d-1 × R) by ∥α∥ TV ≜ sup ϕ∈S H (S d-1 ×R) |⟨α, ϕ⟩|. If ∥α∥ TV is finite, then α is a distribution of order-0. In this case, since S H (S d-1 × R) is dense in the space of even and continuous functions on S d-1 × R that vanish at infinity, α can be extended uniquely to an even signed measure on S d-1 × R, and ∥α∥ TV is equal to the total variation norm of α.

E PROOF OF THEOREM 1

In the proof of the theorem we use the following lemma (for the proof of this lemma see Appendix F). Lemma 3 (Top eigenvalue lower bound). Let f ∈ F k be a twice-differentiable minimizer of the loss function, then λ max ∇ 2 θ L ≥ 1 + 2 ∥f ∥ R,g , where ∥•∥ R,g denotes the stability norm. Let f ∈ F k be a stable solution of the loss function. Then, according to Definition 3, there exists a linearly stable minimum point θ ∈ R (d+2)k+1 such that the network at this minimum implements f . Due to the fact that the knots of f do not contain any training point, we have that θ is a twicedifferentiable minimum. From Lemma 1, since θ is a twice differentiable stable minimum then λ max ∇ 2 θ L ≤ 2 η . ( ) On the other hand, from Lemma 3 we have that λ max ∇ 2 θ L ≥ 1 + 2 ∥f ∥ R,g . Using ( 35) and (36) we get ∥f ∥ R,g ≤ 1 η - 1 2 . ( ) F PROOF OF LEMMA 3 The proof of the Lemma consists of the following steps: 1. Calculating ∇ 2 θ L, and showing that at a global minimum it takes the form ∇ 2 θ L = 1 n ΦΦ ⊤ (Appendix F.1).

2.. Lower bounding λ

max (∇ 2 θ L) by using λ max ∇ 2 θ L = max v∈S (d+2)k v ⊤ ∇ 2 θ L v = max v∈S (d+2)k 1 n Φ ⊤ v 2 = max u∈S n-1 1 n ∥Φu∥ 2 , (38) and lower bounding the right hand side (Appendix F.2). 3. Simplifying the lower bound to obtain a more interpretable version which does not depend on the specific implementation of f (Appendix F.3).

F.1 HESSIAN COMPUTATION

Recall that L(f ) = 1 2n n j=1 (f (x j ) -y j ) 2 , where f (x) = k i=1 w (2) i σ x ⊤ w (1) i + b (1) i + b (2) . ( ) We denote W (1) = w (1) 1 , • • • , w (1) k ∈ R d×k , b (1) = b (1) 1 , • • • , b (1) k ⊤ ∈ R k , w (2) = w (2) 1 , • • • , w (2) k ⊤ ∈ R k , b (2) ∈ R and θ =      vec W (1) b (1) w (2) b (2)      ∈ R (d+2)k+1 . Using these notations, assuming that θ * is a twice differentiable global minimum of L, we have that the gradient is ∇ θ L = 1 n n j=1 (f (x j ) -y j ) ∇ θ f (x j ) . ( ) The Hessian is given by ∇ 2 θ L = 1 n n j=1 ∇ θ f (x j ) ∇ θ f (x j ) ⊤ + 1 n n j=1 (f (x j ) -y j ) ∇ 2 θ f (x j ) = 1 n n j=1 ∇ θ f (x j ) ∇ θ f (x j ) ⊤ , where in the last transition we used ∀j ∈ [n] : f (x j ) = y j (see Def. 2). From direct calculation we obtain ∇ θ f (x) =     vec ∂f ∂W (1) ∇ b (1) f ∇ w (2) f ∇ b (2) f     =       w (2) ⊙ I (x; θ) ⊗ x w (2) ⊙ I (x; θ) W (1) ⊤ x + b (1) ⊙ I (x; θ) 1       , ( ) where ⊙ denotes the Hadamard product, ⊗ represents the Kronecker product and I : R d × R (d+2)k+1 → {0, 1} k is the activation pattern of all neurons for input x, namely [I(x; θ)] i = 1 if x ⊤ w (1) i + b (1) i > 0 and [I(x; θ)] i = 0 otherwise. Let us denote the tangent features matrix by Φ = [∇ θ f (x 1 ) ∇ θ f (x 2 ) • • • ∇ θ f (x n )] ∈ R (dk+2k+1)×n . ( ) Then the Hessian can be expressed as ∇ 2 θ L = ΦΦ ⊤ /n, and its maximal eigenvalue can be written as λ max (∇ 2 θ L) = max v∈S (d+2)k v ⊤ ∇ 2 θ Lv = max v∈S (d+2)k 1 n Φ ⊤ v 2 = max u∈S n-1 1 n ∥Φu∥ 2 .

F.2 LOWER BOUNDING THE TOP EIGENVALUE

Continuing from the previous section's calculation, if we take u = 1 √ n 1, we obtain max u∈S n-1 1 n ∥Φu∥ 2 ≥ 1 n 2 ∥Φ1∥ 2 = 1 + 1 n 2 k i=1    d l=1   n j=1 w (2) i x j,l I j,i   2 +   n j=1 w (2) i I j,i   2 +   n j=1 σ x ⊤ j w (1) i + b (1) i   2    = 1 + 1 n 2 k i=1    w (2) i 2    d l=1   n j=1 x j,l I j,i   2 +   n j=1 I j,i   2    +   n j=1 σ x ⊤ j w (1) i + b (1) i   2    ( * ) ≥ 1 + 2 n 2 k i=1 w (2) i d l=1   n j=1 x j,l I j,i   2 +   n j=1 I j,i   2 n j=1 σ x ⊤ j w (1) i + b (1) i , where in ( * ) we used α 2 + β 2 ≥ 2 |αβ| . Let C i ⊆ {x j } be the set of training points for which the ith neuron is active, and denote n i = |C i |, that is n i = n j=1 I j,i . Then, λ max ∇ 2 θ L ≥ 1 + 2 n 2 k i=1 w (2) i x∈Ci x 2 + n 2 i x∈Ci x ⊤ j w (1) i + b (1) i = 1 + 2 k i=1 w (2) i n i n 2 1 n i x∈Ci x 2 + 1 1 n i x∈Ci x ⊤ w (1) i + b (1) i = 1 + 2 k i=1 w (2) i (P (X ∈ C i )) 2 ∥E [X|X ∈ C i ]∥ 2 + 1E X ⊤ w (1) i + b (1) i |X ∈ C i , where X is a random sample from the dataset under uniform distribution. Next, we define w(1) i ≜ w (1) i w (1) i , b i ≜ -b (1) i w (1) i . ( ) Using these notations we obtain λ max ∇ 2 θ L ≥ 1 + 2 k i=1 w (2) i w (1) i (P (X ∈ C i )) 2 ∥E [X|X ∈ C i ]∥ 2 + 1 × E X ⊤ w(1) i - b(1) i |X ∈ C i ( * ) = 1 + 2 k i=1 w (2) i w (1) i g w(1) i , b i ( * * ) ≥ 1 + 2 k i=1 w (2) i w (1) i g w(1) i , b(1) i , where in ( * ) and ( * * ) we defined, respectively, g w, b = P X ⊤ w > b 2 E X ⊤ w -b X ⊤ w > b E X X ⊤ w > b 2 + 1, g w, b = min g w, b , g -w, -b . ( ) From our derivation so far, we obtain that λ max ∇ 2 θ L ≥ 1 + 2 k i=1 w (2) i w (1) i g w(1) i , b(1) i . ( ) We denote a i = w (2) i ∥w (1) i ∥ and the representation dependent stability norm as S θ ≜ k i=1 |a i | g w(1) i , b i . (55)

F.3 IMPLEMENTATION FREE LOWER BOUND

In this section, our goal is to give a simpler lower bound of the multivariate stability norm S θ , that does not depend on the specific representation of f . Let α be the signed measure over S d-1 × R given by α = k i=1 a i δ w(1) i , b i , and whose total variation measure |α| is given by |α| = k i=1 |a i |δ w(1) i , b i . Recall that f (x) = k i=1 w (1) i w (2) i σ x ⊤ w(1) i - b(1) i + b (2) = k i=1 a i σ x ⊤ w(1) i - b(1) i + b (2) , and thus ∆f (x) = d l=1 ∂ 2 f (x) ∂x 2 l = k i=1 a i δ x ⊤ w(1) i - b(1) i = S d-1 ×R α w, b δ w⊤ x -b ds( w)d b = S d-1 α w, w⊤ x ds( w) = R * α. ( ) Namely, ∆f is a weighted sum of Diracs supported on hyperplanes. From the last equation we obtain α = (R * ) -1 ∆f . Combining these results we get that S θ = S d-1 ×R gd|α| = ⟨|α|, g⟩ ≥ S d-1 ×R (R * ) -1 ∆f (v, b) g (v, b) ds(v)db = ∥f ∥ R,g , ) where the inequality in the third step is due to g being non-negative and the scenarios in which multiple deltas become active at the same location, namely ∃i ̸ = j : w(1) i = w(1) j and b(1) i = b(1) j . Note that if the deltas do not align, then S θ = ∥f ∥ R,g . Overall we have that λ max ∇ 2 θ L ≥ 1 + 2S θ ≥ 1 + 2 ∥f ∥ R,g .

G DERIVATION OF THE STABILITY NORM IN PRIMAL SPACE

We defined the multivariate stability norm as ∥f ∥ R,g = ⟨|(R * ) -1 ∆f |, g⟩ S d-1 ×R , where ⟨•, •⟩ S d-1 ×R denotes the integral inner-product on S d-1 × R. Supposing that the inverse Radon transform of g exists, then by purely formal reasoning we ought to have ∥f ∥ R,g = (R * ) -1 ∆f , g S d-1 ×R = (R * ) -1 ∆f , RR -1 g S d-1 ×R = R * (R * ) -1 ∆f , R -1 g R d . ( ) Therefore, making the (formal) definitions |∆f | R ≜ R * |(R * ) -1 ∆f | and ρ ≜ R -1 g, we may also interpret the stability norm ∥f ∥ R,g as the quantity R d |∆f | R (x)ρ(x)dx. ( ) In the event that f and g are smooth, and g is in the range of the classical Radon transform, then the above expression is equal to ∥f ∥ R,g . However, this is not generally the case in our setting, and below we show how to give a more precise interpretation of this integral formula using distributional theory. In particular, we show that when f is a finite-width ReLU network |∆f | R is equal to the total variation measure of ∆f (i.e., the measure-theoretic analog of the absolute value of a function). Additionally, in the event that g does not have a classically defined Radon inverse, we show how the integral in ( 61) can be interpreted using a smoothing approach. Let f be a finite width single hidden-layer ReLU network, i.e., f ∈ F k for some finite k. Recall that (R * ) -1 ∆f is a finite weighted sum of Diracs in S d-1 × R. Let |(R * ) -1 ∆f | be the associated total variation measure, and define |∆f | R = R * |(R * ) -1 ∆f |, where R * is the distributional dual Radon transform. Here |∆f | R is a tempered distribution given by a (positive) weighted sum of Diracs supported on hyperplanes. For example, if f is a single ReLU unit of the form f Proof. Since f ∈ F k , there exists a representation of f as f (x) = a σ(x ⊤ v -b) with ∥v∥ 2 = 1, then |∆f | R is the distribution |a|δ(x ⊤ v -b), that is, for any test function ϕ we have ⟨|∆f | R , ϕ⟩ R d = |a| x ⊤ v=b ϕ(x)ds(x) = |a|Rϕ(v, b). (x) = k ′ i=1 a i σ(v ⊤ i x -b i ) + x ⊤ q + c where ∥v i ∥ = 1 for all i, a i ̸ = 0 for all i, and (v i , b i ) ̸ = ±(v j , b j ) for all i ̸ = j (i.e., the knots of all ReLU units are distinctfoot_7 ), and where k ′ ≤ k. Therefore, each ReLU unit in this representation of f maps to a distinct Dirac δ (vi,bi) in Radon space after applying the operator (R * ) -1 ∆, and so |(R * ) -1 ∆f | = k ′ i=1 |a i |δ (vi,bi) . ( ) Let X be any compact subset of R d , and let ϕ be any continuous test function defined over X. Then, ⟨|∆f | R , ϕ⟩ = ⟨|(R * ) -1 ∆f |, Rϕ⟩ = k ′ i=1 |a i |Rϕ(v i , b i ). ( ) Now, we show the same equality holds with |∆f | in place of |∆f | R . Let I + denote the set of indices i such that a i > 0 and I -denote the set of indices i such that a i < 0. Define the measures µ + = i∈I + a i δ(v ⊤ i • -b i ) and µ -= -i∈I -a i δ(v ⊤ i • -b i ). Observe that µ + and µ -are both positive measures whose supports only possibly intersect on a set of measure zero, and ∆f = µ + -µ -. This implies the total variation measure of ∆f is given by |∆f | = µ + + µ -= k ′ i=1 |a i |δ(v ⊤ i • -b i ). Hence, ⟨|∆f |, ϕ⟩ = k ′ i=1 |a i |Rϕ(v i , b i ), as claimed. Note, however, that when f corresponds to finite-width ReLU network, |∆f | does not have finite total variation considered as a measure defined over all R d . Due to this technicality, when ρ is not compactly supported, we need to understand the integral in (61) as being with respect to the distribution |∆f | R in place of the measure |∆f |. Now we show that when ρ = R -1 g does not exist in a classical sense, the integral in (61) can still be interpreted using a smoothing approach. Proposition 5. Let f ∈ F k and suppose g is an even, piecewise continuous L 1 function on S d-1 × R and let ρ = R -1 g be its distibutional Radon inverse. Further, assume the support of |(R * ) -1 ∆f | does not intersect the set of points where g is discontinuous. Then ∥f ∥ R,g is finite and ∥f ∥ R,g = R d |∆f (x)|ρ(x)dx, ( ) where the integral above is understood as the finite limit lim ϵ→0 ⟨|∆f | R , ρ ϵ ⟩ R d , ( ) where ρ ϵ is a smooth approximation of ρ defined independently of f whose classical Radon transform g ϵ = Rρ ϵ exists for all ϵ > 0, and g ϵ → g uniformly as ϵ → 0 on any closed subset of R d over which g is continuous. Proof. For any ϵ > 0 let ϕ ϵ ∈ S(S d-1 × R) be a compactly supported even function acting as a smooth approximation of the identity, i.e., for any continuous, even function h vanishing at infinity we have ϕ ϵ * h → h uniformly as ϵ → 0, where * denotes convolution of functions on S d-1 × R. Define g ϵ = (ϕ ϵ * g) • χ ϵ , where χ ϵ (v, b ) is a smooth cutoff function that is equal to one if |b| ≤ 1/ϵ and rapidly decays to zero for |b| ≥ 1/ϵ. Observe that g ϵ is an even Schwartz function by construction. Furthermore, since g is piecewise continuous and L 1 (and in particular, it vanishes at infinity), this implies g ϵ → g uniformly over any closed set that does not intersect the set of points where g is discontinuous. Therefore, if we let U ⊂ S d-1 × R be any closed set containing the support of |(R * ) -1 ∆f | that does not intersect the set of points where g is discontinuous (which is guaranteed to exist since the support of |(R * ) -1 ∆f | is a finite set), then we have g ϵ → g uniformly over U . Therefore ∥f ∥ R,g = ⟨|(R * ) -1 ∆f |, g⟩ S d-1 ×R = lim ϵ→0 ⟨|(R * ) -1 ∆f |, g ϵ ⟩ S d-1 ×R , ( ) where the limit is guaranteed to exist since the finite measure |(R * ) -1 ∆f | is a continuous linear functional over C 0 (U ), the space of continuous functions over U vanishing at infinity. Finally, since g ϵ is Schwartz, (Solmon, 1987, Thm. 7.7) guarantees ρ ϵ = R -1 g ϵ exists as a C ∞ -smooth function on R d that is also integrable along hyperplanes and for which the classical Radon inversion formula holds: Rρ ϵ = g ϵ . Therefore, we have ∥f ∥ R,g = lim ϵ→0 ⟨|(R * ) -1 ∆f |, Rρ ϵ ⟩ S d-1 ×R = lim ϵ→0 ⟨R * |(R * ) -1 ∆f |, ρ ϵ ⟩ R d = lim ϵ→0 ⟨|∆f | R , ρ ϵ ⟩ R d , as claimed. The assumption made above that the support of |(R * ) -1 ∆f | does not intersect the set of points where g is not overly restrictive. For example, this assumption holds when f corresponds to a differentiable minimizer of the squared loss defined in terms of a finite set of training points and g is the data dependent weighting function defined in (12). In this case, the discontinuity set of g(v, b) corresponds to the set of hyperplanes {x : R d : x ⊤ v = b} that intersect one or more of the training points. And f corresponds to a differentiable minimizer if and only if the hyperplanes defined by the knots of the ReLU units making up f (i.e., the support of |(R * ) -1 f |) do not intersect any training points. H EXAMPLES OF g AND ρ

H.1 TWO DATAPOINTS

For this example, it is easy to calculate that g(v, b) = α σ |v 1 | -|b| , where α is a positive constant. We compute ρ = R -1 g by first determining its Laplacian, ∆ρ, then inverting and Laplacian to recover ρ. First, the intertwining property of the Laplacian and the Radon transform gives R∆ρ = ∂ ∂b 2 Rρ = ∂ ∂b 2 g. Therefore, by the Fourier slice theorem (Helgason, 1999) , for all (v, s) ∈ S d-1 × R we have F{∆ρ}(sv) = F b ∂ ∂b 2 g (v, s), where F{•} is the 2-D Fourier transform, and F b {•} is the Fourier transform in the b variable. For fixed v, the function b → g(v, b) is continuous and piecewise linear with knots at 0 and ±|v 1 |, and it is easy to see that ∂ ∂b 2 g(v, b) = α δ(b -|v 1 |) + δ(b + |v 1 |) -2δ(b) , which implies F{∆ρ}(sv) = F b ∂ ∂b 2 g (v, s) = α e -j2π|v1|s + e j2π|v1|s -2 . ( ) If we restrict the unit-norm vector v = (v 1 , v 2 ) to be such that v 1 ≥ 0 and define ξ = sv then we see that x ⊤ 1 ξ = s|v 1 | and x ⊤ 2 ξ = -s|v 1 |. Therefore, we have F{∆ρ}(ξ) = α e -j2πx ⊤ 1 ξ + e -j2πx ⊤ 2 ξ -2 , and inverting the Fourier transform gives ∆ρ(x) = α (δ(x -x 1 ) + δ(x -x 2 ) -2δ(x)) . Finally, since φ(x) = 1 2π log(∥x∥) is the fundamental solution of Poisson's equation in 2D (i.e., ∆φ = δ, where δ is a Dirac centered at origin), we have ρ(x) = α 2π log(∥x -x 1 ∥) + log(∥x -x 2 ∥) -2 log(∥x∥) . While each term in the sum above not absolutely integrable along lines, their sum is absolutely integrable along lines. This is because the function t → log(|t|) is absolutely integrable over any neighborhood of the origin, and by a multipole expansion we may show that ρ(x) = O(∥x∥ -2 ) as x → ∞, which is also absolutely integrable along lines.

H.2 ISOTROPIC DATA DISTRIBUTION

For an isotropic data distribution, i.e., P x ⊤ v > b = M (b) for any v that satisfies ∥v∥ = 1, we will have that g (v, b) =M (b) ∞ b M (z)dz b + 1 M (b) ∞ b M (z)dz 2 + 1. Then, from symmetry and assuming that g is decreasing in b we obtain g(v, b) = M (|b|) ∞ |b| M (z)dz |b| + 1 M (|b|) ∞ |b| M (z)dz 2 + 1. Note that g only depends on |b| and is decreasing in |b|. Considering the parameter space representation for the stability norm S θ = k i=1 |a i | g (v i , b i ) , we can see that solutions with larger |b i |, i.e., solutions which are more flat in function space, will have smaller stability norm. We now characterize ρ = R -1 g. For simplicity, we focus on the two-dimensional setting (d = 2). Since g(v, b) does not depend on v, we drop this dependence and simply write g(b). Note that this implies ρ is a radial function. Let ρ(r) denote the radial profile of ρ, i.e., ρ(x) = ρ(∥x∥). We additionally make the following assumptions: g(b) is twice continuously differentiable away from the origin, and both g and its weak derivative g ′ are bounded and absolutely integrable. In this case, ρ has the integral formulafoot_8  ρ(r) = - 1 π ∞ r g ′ (b) √ b 2 -r 2 db. The assumptions on g above are sufficient to show the integrand in ( 81) is absolutely integrable over [r, ∞) for r > 0. Since g(|b|) is assumed to be decreasing in |b|, we have -g ′ (b) ≥ 0 for all b > 0, which shows ρ(r) ≥ 0 for all r > 0. However, if g is not smooth at the origin, then g ′ (b) = O(1) as b → 0 + , and elementary analysis shows ρ(r) = O(log(r)) as r → 0 + and ρ(r) = O(1/r) as r → +∞. Finally, if we additionally assume g ′ (b) is non-increasing for b > 0, then ρ(r) is strictly decreasing for r > 0. To see this, fix any r ′ > r, and define δ = r ′ -r. Using the change of variables b → b -δ, we may show ρ(r ′ ) = - 1 π ∞ r g ′ (b + δ) (b + δ) 2 -r 2 db. Since g ′ is assumed to be non-increasing, we have g ′ (b + δ) ≤ g ′ (b) and it is elementary to show ((b + δ) 2 -r 2 ) -1/2 < (b 2 -r 2 ) -1/2 for all b > r, which shows the integrand in ( 82) is pointwise strictly bounded above by the integrand in ( 81) for all b > r, hence ρ(r ′ ) < ρ(r). In the case of 2D Gaussian distributed data X ∼ N (0, I), then M (b) is the complementary of the CDF of a normal random variable: M (b) = 1 √ 2π ∞ b e -b 2 /2 db. It is easy to verify that the resulting g function satisfies the above assumptions (g(b) is decreasing, twice continuously differentiable away from the origin, both g and its weak derivative g ′ are bounded and absolutely integrable, and g ′ is non-increasing). Therefore, the resulting ρ has the all the properties outlined above.

I DEPTH SEPARATION PROOFS

Before giving the proofs in this section we introduce some additional notation. Let X denote the closed convex hull of the training points and X its open interior. Additionally, let Y = {(v, b) ∈ S d-1 × R : v ⊤ x > b for some x ∈ X}, and let Y denote its closure. Note that for any smooth function ϕ with support contained in X, the Radon transform Rϕ has support contained in Y . Finally, for any distribution h and open set U , we let h| U denote its restriction to U .

I.1 PROOF OF PROPOSITION 1

First, we show that the convergence of a sequence of functions f k to f in L 1 -norm over X implies that the sequence of distributions ∆f k | X converges weakly to the distribution ∆f | X . For all test functions ϕ ∈ S(X) we have |⟨∆f k -∆f, ϕ⟩| = |⟨f k -f, ∆ϕ⟩| ≤ ∥f k -f ∥ L 1 (X) ∥∆ϕ∥ L ∞ (X) , where we used Holder's inequality to achieve the final bound. Therefore, we have lim k→∞ ⟨∆f k , ϕ⟩ → ⟨∆f, ϕ⟩, which proves the weak convergence.

Next, we show

(R * ) -1 ∆f k | Y converges weakly to (R * ) -1 ∆f | Y . For all test functions φ ∈ S H (Y ) we have (R * ) -1 ∆f k -(R * ) -1 ∆f, φ = ∆f k -∆f, R -1 φ . (84) Since ϕ vanishes outside Y , by the support theorem (Helgason, 1999, Corollary 2.8) we are ensured that R -1 φ has support contained in X, hence R -1 φ ∈ S(X). The desired result now follows immediately by weak convergence of ∆f k | X to ∆f | X . Since g has support contained in Y , this further implies that the distribution g • (R * ) -1 ∆f k converges weakly to the distribution g • (R * ) -1 ∆f . Finally, since ∥f k ∥ R,g = ∥g • (R * ) -1 ∆f k ∥ TV is bounded by assumption, this implies each g • (R * ) -1 ∆f k is a measure having finite total variation, hence their weak limit g • (R * ) -1 ∆f is also a measure with finite total variation (i.e., the weak limit of order-0 distributions that are bounded in TV-norm is also an order-0 distribution). Therefore, ∥f ∥ R,g = ∥g • (R * ) -1 ∆f ∥ TV is finite, as claimed.

I.2 PROOF OF PROPOSITION 2

To show the pyramid function p has infinite stability norm, we prove that g•(R * ) -1 ∆p is a distribution of order > 0, which implies ∥p∥ R,g = sup ϕ∈S H (S d-1 ×R) ⟨g • (R * ) -1 ∆p, ϕ⟩ = +∞. First, observe that the Laplacian ∆p is an order-0 distribution whose support is contained in the unit ℓ 1 -ball. This implies the distribution (R * ) -1 ∆p, is supported on a compact set K in Radon domain. By our assumption on the convex hull of the training points, g(v, b) > 0 for all (v, b) ∈ K. Since g is piecewise continuous and K is compact, this implies there exists constants c 1 , c 2 > 0 such that c 1 ≤ g(v, b) ≤ c 2 for all (v, b) ∈ K. Therefore, we see that g • (R * ) -1 ∆p is an order-0 distribution if and only if (R * ) -1 ∆p is an order-0 distribution. However, by a result in (Ongie et al., 2020) , we know (R * ) -1 ∆p has order > 0 (i.e., in the terminology of (Ongie et al., 2020) , p has infinite R-norm). Additionally, we show this by direct calculation in App. M in the case of input dimension d = 2. Therefore, g • (R * ) -1 ∆p must be a distribution of order > 0 and hence ∥p∥ R,g = +∞ as claimed.

I.3 STABILITY OF THE TWO HIDDEN-LAYER IMPLEMENTATION OF p(x)

Let us focus on the under-parameterized setting, in which there exists a single optimal input-output predictor p(x) that globally minimizes the loss. In this case, the set of all global minima corresponds to different implementations of p(x). Under this setting, we will prove that there exists a set of nonzero Lebesgue measure such that for any initialization inside this set, GD necessarily converges to p(x). To do so, we will first prove that for any minimum point θ * ∈ R m corresponding to an implementation of p(x), there exists a nonzero step size η with which θ * is linearly stable. Furthermore, we will show that there exists a set T s loc (θ * ) embedded in a subspace of dimension m -m Null , in which any initialization converges to θ * . Here m is the number of parameters in our two hidden-layer network, and m Null is the number of zero eigenvalues of ∇ 2 L at θ * . Next, we will show that there is a connected set of minima Θ * around θ * , such that the union θ∈Θ * T s loc (θ) has a nonzero Lebesgue measure. Let us start with some minimum point θ * , which corresponds to an implementation of p(x). GD's update rule is θ t+1 = θ t -η∇L(θ t ). ( ) Define the mapping T (θ) = θ -η∇L(θ). Then ( 86) can be written as θ t+1 = T (θ t ). ( ) This equation describes the full dynamics of GD using the nonlinear mapping T . Note that in this representation, θ * is an equilibrium point of T , i.e., T (θ * ) = θ * . We would like to show that it is possible to converge to θ * . Assume there is a finite number of training samples n, and none of them coincide with the knots of p. Then T is differentiable in a small neighborhood of θ * . The Jacobian matrix of T is ∂ ∂θ T = I -η∇ 2 L(θ * ), and its eigenvalues are λ i ∂ ∂θ T = λ i I -η∇ 2 L(θ * ) = 1 -ηλ i ∇ 2 L(θ * ) . In this setting, the loss' Hessian at θ * has non-negative and bounded eigenvalues. Particularly, there existsfoot_9 a sufficiently small step size η that satisfies 0 < 1 -ηλ i ∇ 2 L(θ * ) ≤ 1, for all i. Using the Center and Stable Manifold Theorem (Shub, 2013, Th. III.7) , there exists a bounded set T s loc (θ * ) such that T (T s loc (θ * )) ⊆ T s loc (θ * ) and ∀θ ∈ T s loc (θ * ) : ∥T (θ) -θ * ∥ ≤ α ∥θ -θ * ∥ , for some 0 ≤ α < 1. Here T s loc (θ * ) is tangent to the hyperplane that contains θ * and is spanned by the nonzero eigenvectors of ∇ 2 L(θ * ). Thus, assume that the initial point θ 0 ∈ T s loc (θ * ), then ∥T (θ t ) -θ * ∥ ≤ α t ∥T (θ 0 ) -θ * ∥ -→ t→∞ 0, which shows that with any initialization in T s loc (θ * ), GD's iterations converge to θ * . Next, note that there is a neighborhood of θ * within which the set of global minima of the loss form a smooth m Null -dimensional manifoldfoot_10 . Let us denote this set of global minima around θ * by Θ * , and set η < 2/ max θ∈Θ * {λ max (∇ 2 L)} and limit the set Θ * such that ∀i : η ̸ = 1/λ i . Then, for each minimum θ ∈ Θ * , according to the first part of the proof, there exists a (m -m Null )-dimensional set T s loc (θ). Now, since each T s loc (θ) is contained in a hyperplane that is orthogonal to the tangent of Θ * at θ, and the dimension of the tangent of Θ * at θ * is m Null , we have that the dimension of the union of these sets, θ∈Θ * T s loc (θ), is m. Thus, the set θ∈Θ * T s loc (θ) is of nonzero Lebesgue measure within R m and for any initialization in θ∈Θ * T s loc (θ), GD converges to p(x).

J GENERAL LOSS FUNCTIONS

In this section, we discuss how our results can be extended to general loss function with a unique finite root. Assume some general loss function L(f ) = 1 n n j=1 ℓ (f (x j ), y j ) , where ℓ(a, b) is twice differentiable w.r.t. a and is minimized when a = b, i.e., ℓ ′ (a, b) ≜ ∂ ∂a ℓ(a, b) = 0 ∀a = b. Then, we can calculate the loss' gradient ∇ θ L = 1 n n j=1 ℓ ′ (f (x j ), y j ) ∇ θ f (x j ) , and Hessian matrix ∇ 2 θ L = 1 n n j=1 ℓ ′′ (f (x j ), y j ) ∇ θ f (x j ) ∇ θ f (x j ) ⊤ + 1 n n j=1 ℓ ′ (f (x j ), y j ) ∇ 2 θ f (x j ) = 1 n n j=1 ℓ ′′ (f (x j ), y j ) ∇ θ f (x j ) ∇ θ f (x j ) ⊤ , where ℓ ′′ (a, b) ≜ ∂ 2 ∂a 2 ℓ(a, b), and in the last transition we used f (x j ) = y j and ℓ ′ (a, a) = 0 for all a ∈ R. If ℓ ′′ (f (x j ), y j ) = C > 0 for all training points, then we can generalize our results by simply multiplying the RHS of (47) by C. If not, the analysis can still be used but we need to add a weigtning term to the stability norm which depends on the value of ℓ ′′ (f (x j ), y j ) for each data point.

K PROOF OF LEMMA 2

In this section, our goal is to upper bound the top eigenvalue of the flattest implementation of a predictor function f . Here we use notation and some derivations form App. F.1. Let q denote the top right singular vector of Φ, then λ max ∇ 2 θ L = 1 n ∥Φq∥ 2 = 1 n      n j=1 q j   2 + k i=1    d l=1   n j=1 q j w (2) i x j,l I j,i   2 +   n j=1 q j w (2) i I j,i   2 +   n j=1 q j σ x ⊤ j w (1) i + b (1) i   2       ≤ 1 n   n + k i=1    w (2) i 2    d l=1   n j=1 q j x j,l I j,i   2 +   n j=1 q j I j,i   2    +   n j=1 q j w (1) i σ x ⊤ j w(1) i - b(1) i   2       where in the inequality we used n j=1 u j

2

≤ n for all u ∈ S n-1 and substituted w(1) i ≜ w (1) i w (1) i , b i ≜ -b (1) i w (1) i . ( ) Let Θ(f ) be the set of all implementations corresponding to f . Since substituting w (1) i → c -1 i w (1) i b (1) i → c -1 i b (1) i w (2) i → c i w (2) i does not affect the network's functionality f , we have min θ∈Θ(f ) λ max ∇ 2 θ L ≤ min θ∈Θ(f ) 1 n   n + k i=1    w (2) i 2    d l=1   n j=1 q j x j,l I j,i   2 +   n j=1 q j I j,i   2    +   n j=1 q j w (1) i σ x ⊤ j w(1) i - b(1) i   2       = min θ∈Θ(f ),c 2 i >0 1 n   n + k i=1   c 2 i w (2) i 2    d l=1   n j=1 q j x j,l I j,i   2 +   n j=1 q j I j,i   2    +c -2 i w (1) i 2   n j=1 q j σ x ⊤ j w(1) i - b(1) i   2       . A necessary condition for optimality is that the derivative of the objective with respect to c i is equal to zero: 2c i w (2) i 2    d l=1   n j=1 q j x j,l I j,i   2 +   n j=1 q j I j,i   2    -2c -3 i w (1) i 2   n j=1 q j σ x ⊤ j w(1) i - b(1) i   2 = 0 ⇒ c 2 i = w (1) i n j=1 q j σ x ⊤ j w(1) i - b(1) i w (2) i d l=1 n j=1 q j x j,l I j,i 2 + n j=1 q j I j,i 2 . ( ) It is easy to verify that these solutions for {c i } are indeed global minima. Plugging this in, we get min θ∈Θ(f ) λ max ∇ 2 θ L ≤ min θ∈Θ(f ) 1 n   n + 2 k i=1   w (1) i w (2) i n j=1 q j σ x ⊤ j w(1) i - b(1) i × d l=1   n j=1 q j x j,l I j,i   2 +   n j=1 q j I j,i   2         . Now, by Cauchy-Schwarz inequality three times we get n j=1 q j σ x ⊤ j w(1) i - b(1) i ≤ ∥q∥ n j=1 σ 2 x ⊤ j w(1) i - b(1) i ,   n j=1 q j x j,l I j,i   2 ≤ ∥q∥ 2 n j=1 x 2 j,l I j,i ,   n j=1 q j I j,i   2 ≤ ∥q∥ 2 n j=1 I j,i . Since ∥q∥ = 1, the right hand sides of these inequalities are independent of q. Thus, using these inequalities to further upper bound the top eigenvalue we have min θ∈Θ(f ) λ max ∇ 2 θ L ≤ min θ∈Θ(f )   1 + 2 n k i=1    w (1) i w (2) i n j=1 σ 2 x ⊤ j w(1) i - b(1) i d l=1   n j=1 x 2 j,l I j,i   +   n j=1 I j,i         = 1 + 2 n min θ∈Θ(f ) k i=1 w (1) i w (2) i n j=1 σ 2 x ⊤ j w(1) i - b(1) i n j=1 ∥x j ∥ 2 + 1 I j,i . To continue upper bounding the sharpness (λ max (∇ 2 L)) of the flattest implementation, we can consider some implementation of f . Specifically, since f ∈ F k , it can be represented as f (x) = k ′ i=1 a i σ(v ⊤ i x -b i ) + βx ⊤ h + c, where ∥v i ∥ = 1 for all i, ∥h∥ = 1, a i ̸ = 0 for all i, and (v i , b i ) ̸ = ±(v j , b j ) for all i ̸ = j (i.e., the knots of all ReLU units are distinctfoot_11 ), and where k ′ ≤ k. Thus, we use the following implementation of f w (1) i = v i , b i = -b i , w (2) i = a i , b (2) = c + βτ, for i ∈ [k ′ ], where τ = 1 n n j=1 h ⊤ x j . Additionally, if needed, we add two ReLU neurons to implement the linear component. w (1) k ′ +1 = h, b k ′ +1 = -τ, w k ′ +1 = β, w (1) k ′ +2 = -h, b (1) k ′ +2 = τ, w (2) k ′ +2 = -β. Thus, 1 + 2 n min θ∈Θ(f ) k i=1 w (1) i w (2) i n j=1 σ 2 x ⊤ j w(1) i - b(1) i n j=1 ∥x j ∥ 2 + 1 I j,i ≤ 1 + 2 n k ′ i=1 |a i | n j=1 σ 2 x ⊤ j v i -b i n j=1 ∥x j ∥ 2 + 1 I j,i + 2 n   |β| n j=1 σ 2 x ⊤ j h -τ n j=1 ∥x j ∥ 2 + 1 I j,k ′ +1 +|β| n j=1 σ 2 -x ⊤ j h + τ n j=1 ∥x j ∥ 2 + 1 I j,k ′ +2   ≤ 1 + 2 n k ′ i=1 |a i | n j=1 σ 2 x ⊤ j v i -b i n j=1 ∥x j ∥ 2 + 1 I j,i + 4|β| n n j=1 x ⊤ j h -τ 2 n j=1 ∥x j ∥ 2 + 1 , where in the last inequality we used max σ 2 x ⊤ j h -τ , σ 2 -x ⊤ j h + τ ≤ x ⊤ j h -τ 2 , and ∥x j ∥ 2 + 1 > 0 and thus removing the indicator term only increases the RHS of ( 110). Recall that we denoted C i ⊆ {x j } be the set of training points for which the ith neuron is active, and n i = |C i |. Then, 2 n k ′ i=1 |a i | n j=1 σ 2 x ⊤ j v i -b i n j=1 ∥x j ∥ 2 + 1 I j,i = 2 k ′ i=1 |a i | n i n 1 n i j∈Ci x ⊤ j v i -b i 2 1 n i j∈Ci ∥x j ∥ 2 + 1 = 2 k ′ i=1 |a i |P x ⊤ v i > b i E (x ⊤ v i -b i ) 2 x ⊤ v i > b i E 1 + ∥x∥ 2 x ⊤ v i > b i . (112) Additionally, 4|β| n n j=1 x ⊤ j h -τ 2 n j=1 ∥x j ∥ 2 + 1 = 4|β| Var (x ⊤ h) 1 + E ∥x∥ 2 . (113) Define ĝ(v, b) = P x ⊤ v > b E (x ⊤ v -b) 2 x ⊤ v > b 1 + E ∥x∥ 2 x ⊤ v > b . Then, we have min θ∈Θ(f ) λ max ∇ 2 θ L ≤ 1 + 2 k ′ i=1 |a i |ĝ(v i , b i ) + 4|β| Var (x ⊤ h) 1 + E ∥x∥ 2 = 1 + 2 S d-1 ×R (R * ) -1 ∆f (v, b) ĝ(v, b)ds(v)db + 4|β| Var (x ⊤ h) 1 + E ∥x∥ 2 = 1 + 2 ∥f ∥ R,ĝ + 4|β| Var (x ⊤ h) 1 + E ∥x∥ 2 . ( ) Note that, Var (x ⊤ h) ≤ λ max Σ x , where Σ x is the covariance matrix of x. Additionally, ∥∇f (x)∥ = k ′ i=1 a i 1 v ⊤ i x-bi>0 v i + βh ≥ |β| ∥h∥ - k ′ i=1 a i 1 v ⊤ i x-bi>0 v i = |β| - k ′ i=1 |a i | ∥v i ∥ 1 v ⊤ i x-bi>0 ≥ |β| - k ′ i=1 |a i | = |β| - S d-1 ×R (R * ) -1 ∆f ds(v)db = |β| -∥f ∥ R . Therefore, for any x ∈ R d |β| ≤ ∥∇f (x)∥ + ∥f ∥ R . Taking the tightest bound we obtain |β| ≤ ∥f ∥ R + inf x∈R d ∥∇f (x)∥ . Overall, combining (115), (116), and (119) we obtain min θ∈Θ(f ) λ max ∇ 2 θ L ≤ 1 + 2 ∥f ∥ R,ĝ + 4 ∥f ∥ R + inf x∈R d ∥∇f (x)∥ λ max Σ x 1 + E ∥x∥ 2 . ( ) L PROOF OF PROPOSITION 3 Let f ∈ W d+1,1 w (R d ). First, we show that this implies both ∥f ∥ R and ∥f ∥ R,ĝ are finite. Since we assume d is odd (-∆) (d+1)/2 f is an integral power of the negative Laplacian applied to f , hence can be expanded as a linear combination of d + 1 order partial derivatives, and so ∥(-∆) (d+1)/2 f ∥ 1,w ≤ a d |β|=d+1 ∥∂ β f ∥ 1,w ≤ a d ∥f ∥ W d+1,1 w (R d ) , where a d is a constant depending on d but independent of f . Therefore, ∥(-∆) (d+1)/2 f ∥ 1,w is finite. In particular, this shows (-∆) (d+1)/2 ∈ L 1 (R d ), and so R(-∆) (d+1)/2 f exists in a classical sense. This implies have the formulas ∥f (Ongie et al., 2020, Prop. 1) ). ∥ R = γ d ∥R(-∆) (d+1)/2 f ∥ 1 and ∥f ∥ R,ĝ = γ d ∥ĝ • R(-∆) (d+1)/2 f ∥ 1 where γ d = 1 2(2π) d-1 (see Recall that w (x) = R * [1 + |b|](x) = c d + ζ d ∥x∥, with c d = S d-1 dv and ζ d = S d-1 |v 1 |dv. Therefore we have that ∥(-∆) (d+1)/2 f ∥ 1,w = R d (-∆) (d+1)/2 f (x) w(x)dx = S d-1 ×R R (-∆) (d+1)/2 f (v, b) (1 + |b|)ds(v)db ≥ S d-1 ×R R (-∆) (d+1)/2 f (v, b) (1 + |b|)ds(v)db = S d-1 ×R R (-∆) (d+1)/2 f 1 + |b| 1 + ĝ(v, b) (1 + ĝ(v, b))ds(v)db ≥ C ĝ S d-1 ×R R (-∆) (d+1)/2 f (1 + ĝ(v, b))ds(v)db = γ -1 d C ĝ (∥f ∥ R + ∥f ∥ R,ĝ ), where Let α = (R * ) -1 ∆f = -γ d R(-∆) (d+1)/2 f . Then ∥f ∥ R = ∥α∥ 1 is finite, and so α is an L 1 function, which can be identified with a finite signed measure. Since ∥f ∥ R,g = ∥ĝ • α∥ 1 is also finite, we see that α(v, b) := (1 + ĝ(v, b)))α(v, b) is also an L 1 function which can be identified with a finite signed measure. By (Malliavin et al., 1995, Thm. 6.9) , this implies there exists a sequence of finite atomic measures {α k }, such that each αk consists of a sum of at most k Diracs, converging narrowlyfoot_12 to α with ∥α C ĝ = inf (v,b)∈S d-1 ×R k ∥ TV ≤ ∥α∥ 1 . Define α k (v, b) = αk (v, b)/(1 + ĝ(v, b)) , which is also an atomic measure. Then it is easy to show α k → α narrowly, as well. By Lemma 5 of (Ongie et al., 2020) , this implies there exists a sequence of single hidden-layer ReLU networks f k ∈ F k converging to f pointwise. Therefore, for all k we have ∥f k ∥ R + ∥f k ∥ R,ĝ = ∥f k ∥ R,1+ĝ = ∥α k ∥ TV ≤ ∥α∥ 1 = ∥f ∥ R,1+ĝ = ∥f ∥ R + ∥f ∥ R,ĝ . (123) Combining this inequality with the bound on ∥f ∥ R + ∥f ∥ R,ĝ given above, we see that ∥f k ∥ R + ∥f k ∥ R,ĝ ≤ ∥f ∥ R + ∥f ∥ R,ĝ ≤ c d,ĝ ∥f ∥ W d+1,1 w (R d ) , where c d,ĝ = a d γ d C -1 ĝ is a constant defined independently of f . Finally, if K is any compact subset, the pointwise convergence of f k to f on K can be upgraded to L 1 -convergence using Lebesgue's dominated convergence theorem: by the bounds on the Lipschitz constant of a function given in terms of the R-norm in Proposition 8 of (Ongie et al., 2020) , we have |f k (x)| ≤ ∥x∥(C + ∥f k ∥ R ) ≤ B∥x∥ for some constants C, B ≥ 0, and since x → B∥x∥ is L 1 -integrable over any compact subset, the hypotheses of Lesbesgue's dominated convergence theorem hold. M STABILITY NORM OF "PYRAMID" FUNCTION Here, to provide a better understanding of the depth separation result in Proposition 2, we show by direct calculation that the "pyramid" function in d = 2 dimensions, given by p(x) = p(x 1 , x 2 ) = [1 -|x 1 | -|x 2 |] + , fails to have finite stability norm. In particular, we explicitly compute (R * ) -1 ∆p as a tempered distribution and show it is not a finite measure (i.e., must be a distribution of order > 0), which implies it cannot have finite stability norm under the assumptions in Proposition 2. First, observe that ∆p is linear combination of Diracs supported on finite line segments ℓ k defining the "edges" of the pyramid: ∆p(x) = k c k δ ℓ k . ( ) This means that if ϕ is any Schwartz class test function, then Finally, by linearity of the operator KR, we have (R * ) -1 ∆p = KR∆p = k c k KRδ ℓ k . Thus, [(R * ) -1 ∆p](v(θ), b) = k c k | sin(θ -θ k )| -1 p.v. 1 b -α k (θ) -p.v. 1 b -β k (θ) + k c k |ℓ k |δ(θ -θ k ) • p.v. -1 (b -b k ) 2 . (139) See Figure 7 for an approximate plot of (R * ) -1 ∆p = KR∆p. As evidenced by the plot, this density has singularities along a 1-D manifold S in Radon domain. This set corresponds to all lines in the primal domain passing through the corners of the pyramid. Finally, we show that α := (R * ) -1 ∆p is not a finite measure (i.e., it is not an order zero distribution). Intuitively, this is because the "density" α(v, b) is not absolutely integrable, since every 1-D angular slice has singularities like 1/|b|. Below we prove this more formally. To prove α cannot be an order zero distribution, we construct a family of uniformly bounded test functions {φ ϵ } ϵ>0 such that |⟨α, φ ϵ ⟩| ≥ ρ(ϵ)∥φ ϵ ∥ ∞ , where ρ(ϵ) is a function such that ρ(ϵ) → +∞ as ϵ → 0 + . Let γ > 0 be a small fixed constant less than one. For every 0 < ϵ < γ, consider the "rainbow-shaped" subset of Radon domain Ω ϵ defined by the inequalities -γ/2 < θ < γ/2 and cos(θ)-ϵ < b < cos(θ) where . In primal domain, the set corresponds to a collection of lines that nearly intersect the corner point (1, 0). Only three terms in the sum making up (R * ) -1 ∆p in (139) are dominant in the region Ω ϵ , corresponding to the three line segments in the support of ∆p that arise from the right-most corner of the pyramid. Elementary calculations show these three terms are specified by the parameters:  c 1 = -2, θ 1 = π/2, β 1 (θ) = cos(θ), where we omit the terms p.v. -1 b-α k (θ) and p.v. -1 (b-b k ) 2 , since points in Ω ϵ are far from their singularity set. In particular, we can show α -α is an order zero distribution when restricted to Ω ϵ (i.e., all other terms are locally smooth and bounded). Let g(θ) be the function of θ in front of the principle value in α. Note that g(θ) > B > 0 for all θ ∈ Ω ϵ where the constant B is independent of ϵ. Let φ ϵ (θ, b) be a smooth function supported in Ω ϵ such that 0 ≤ φ ϵ (θ, b) ≤ 1 and φ ϵ (θ, b) = 1 on the region defined by the inequalities -γ/2 < θ < γ/2 and cos(θ) -ϵ ≤ b ≤ cos(θ) -ϵ 2 . Then for any fixed θ ∈ (-γ/2, γ/2), the integral p.v. φϵ(θ,•) cos(θ)-b db is bounded below by (142) Since ∥φ ϵ ∥ ∞ = 1 and γB log(ϵ -1 ) -C → +∞ as ϵ → 0 + , this shows α cannot be a distribution of order zero, i.e., it cannot be identified with a finite measure.

N ADDITIONAL EXPERIMENTS

The experiments in Sec. 6 are designed to demonstrate Theorem 1 in a diverse range of step sizes ([10 -4 , 0.1]). Since flat minima of the loss landscape are concentrated near the origin in parameter space, and training with small step size near flat minima is inefficient, we used large initialization (about 10 times larger than standard methods). Here we repeat the MNIST experiment using various initialization scales on a higher range of step sizes ([10 -3 , 0.2]). Figure 8 presents the sharpness curves for the different scales. For large initialization, ×10 and ×15, we get the same behavior as depicted in Sec. 6. For small initialization, ×1 and ×5, the sharpness of the obtained solutions is fixed for small learning rates up to a critical step size η * . At this threshold, the sharpness equals 2/η * , and any increment in the step size makes the minimum unstable. This pushes SGD to flatter minima for larger step sizes, ones that satisfy the stability criterion. Here it is important to note that for standard initialization, shown in Fig. 8 (a), the threshold is well before the standard step size of η = 0.1. Namely, this phenomenon happens using standard initialization and standard learning rate. We trained a single ReLU network for binary classification on two classes from MNIST using SGD (see Sec. 6 for details). Specifically, we initialized the network using different scales, and for each scale we trained the network using multiple step sizes. We see that as η increases, the minima get flatter in parameter space (yellow curve), which translates to smoother predictors in function space (purple curve).



In a slight abuse of terms, in this paper we say a function is 'smooth' if some weighted L 1 norm of its second derivative is bounded. A 'knot' is a boundary between two pieces (i.e., intersection between hyperplanes). See Fig.2for illustration. We focus on MSE loss for simplicity, but the results can be extended to other loss functions, see App. J. Which is true, for example, when the training set is finite. This integral should be interpreted in the distributional sense (see App. G for details). Note that for GD, η < 2/λmax is a necessary and sufficient condition for linear stability. Specifically, for all positive integers k, the function v → b∈R ϕ(v, b)b k db needs to be a homogeneous polynomial in v of degree k. If (vi, bi) = (vj, bj) then one of the neurons is redundant. If (vi, bi) = -(vj, bj) then these units can be combined into an affine function, which we "absorb" into the term x ⊤ q + c. See Proposition 3.5.1 in(Epstein, 2007). Specifically, any η such that η ≤ 2/λmax(∇ 2 L(θ * )) and ∀i : η ̸ = 1/λi, satisfies this condition. This set corresponds to multiplying the weights of corresponding neurons within different layers by positive factors whose product is 1. If (vi, bi) = (vj, bj) then one of the neurons is redundant. If (vi, bi) = -(vj, bj) then these units can be combined into an affine function, which we "absorb" into the term αx ⊤ h + c. Namely, ⟨α k , φ⟩ → ⟨α, φ⟩ for all continuous and bounded functions φ : S d-1 × R → R.



Figure 1: Larger step size leads to smoother prediction function. We train a single hidden-layer ReLU network on a regression task with two-dimensional data, depicted by red points. The different panels show the predictor function f obtained when training with different step sizes.

Figure 2: Illustration of the stability norm. Panel (a) depicts an interpolating function f . Panel (b) displays the absolute value of the Laplacian of f , i.e., |∆f |. Here the color codes the amplitude of the delta functions. Panel (c) presents the weight function ρ. The stability norm is the weighted sum of line integrals of ρ, according to |∆f |.

Figure 3: Visualization of g of Thm. 1 and ρ of (13) for two toy examples. (a), (b) Two data points x 1 = (1, 0) and x 2 = (-1, 0). (c),(d) Two dimensional Gaussian data, i.e., X ∼ N (0, I).

Figure4: Validating the bounds on synthetic data. We trained a two-layer ReLU network on a regression task with synthetic data using GD (see Sec. 6). Panel (a) depicts the sharpness of the minima to which GD converged, as a function of the step size η. As η increases, the minima get flatter in parameter space (yellow curve), which translates to smoother predictors in function space (purple curve). Panel (b) shows the norm of the bias vector b as a function of the step size. Here we see that the bias vector grows with the step size, as the predictor function gets smoother.

Figure 5: Validating the bounds on MNIST. We trained a single hidden-layer ReLU network for binary classification on two classes from MNIST using SGD (see Sec. 6). Panel (a) depicts the sharpness versus the step size η. Here as η increases, the minima get flatter in parameter space (yellow curve), which translates to smoother predictors in function space (purple curve). Panel (b) shows the performance on the validation set. Here the trained model generalizes better as the step size increases.

(d -1)-dimensional surface measure on the hyperplane v ⊤ x = b. Note that the Radon transform is an even function, i.e., Rf (v, b) = Rf (-v, -b), since (v, b) and (-v, -b) describe the same hyperplane.

On the other hand, treating ∆f as a measure defined over a compact subset of R d , its total variation measure |∆f | is also equal to |a|δ(x ⊤ v -b). The following result shows that, more generally, when f is any finite width ReLU net then |∆f | R and |∆f | are equal as measures. Proposition 4. Let f ∈ F k , then |∆f | R = |∆f | as measures defined over any compact set of R d .

1+ĝ(v,b) is finite and non-zero because for all v ∈ S d-1 we have ĝ(v, b) = O(|b|) as |b| → ∞ where the implied constant is independent of v. Therefore, we have shown ∥f ∥ R and ∥f ∥ R,ĝ are finite as claimed.

that ∆p(x) is a finite measure (i.e., a distribution of order zero), since|⟨∆p(x), ϕ⟩| ≤ k |c k ||ℓ k | ∥ϕ∥ ∞ ,(128)where |ℓ k | is the length of the line segment ℓ k . See Fig.6below for illustration.

Figure 6: Visualizations of the pyramid function p and its Laplacian -∆p.

Figure 7: Visualizations of R∆p and (R * ) -1 ∆p.

2 = π/4, β 2 (θ) = cos(θ), c 3 = √ 2, θ 3 = -π/4, β 3 (θ) = cos(θ).(140)Therefore, α is well-approximated on Ω ϵ byα = √ 2 | sin(θ -π/4)| + √ 2 | sin(θ + π/4)| -2 | sin(θ -π/2)| p.v. 1 cos(θ) -b ,

db = log(ϵ -1 ). Therefore, we have|⟨α, φ ϵ ⟩| ≥ |⟨α, φ ϵ ⟩| -|⟨α -α, φ ϵ ⟩| ≥ γ/2 -γ/2 ⟨α θ , φ ϵ (θ, •)⟩dθ -C∥φ ϵ ∥ ∞ ≥ (γB log(ϵ -1 ) -C)∥φ ϵ ∥ ∞ .

Figure8: Sharpness vs. step size for different initialization scales. We trained a single ReLU network for binary classification on two classes from MNIST using SGD (see Sec. 6 for details). Specifically, we initialized the network using different scales, and for each scale we trained the network using multiple step sizes. We see that as η increases, the minima get flatter in parameter space (yellow curve), which translates to smoother predictors in function space (purple curve).

ACKNOWLEDGMENTS

The research of RM was supported by the Planning and Budgeting Committee of the Israeli Council for Higher Education, and by the Andrew and Erna Finci Viterbi Graduate Fellowship. GO was supported by NSF CRII award CCF-2153371. The research of DS was funded by the European Union (ERC, A-B-C-Deep, 101039436). Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency (ERCEA). Neither the European Union nor the granting authority can be held responsible for them. DS also acknowledges the support of Schmidt Career Advancement Chair in AI. TM was supported by grant 2318/22 from the Israel Science Foundation and by the Ollendorff Center of the Viterbi Faculty of Electrical and Computer Engineering at the Technion.

annex

Suppose ℓ is a vertical line segment ℓ = {(0, t) : t ∈ [c, d]}. Assuming v is such that v 2 ̸ = 0, then the inner integral above with ℓ in place of ℓ k above simplifies asIn the event that v 2 = 0 we haveTherefore, we have shownNow, consider one of line segments ℓ k coinciding with the edges of the pyramid. This can be parameterized as, then ℓ k is a rotation of the vertical line segment ℓ through the angle θ k , and translation by b k v k . Therefore, by properties of Radon transforms,where we set v(θ) = [cos(θ), sin(θ)] for all θ ∈ [0, π). More concretely, we can express every slice Rδ ℓ k (v, •) as either a weighted indicator function when v ̸ = ±v k , which is non-zero when b is such that the line L v,b := {x : v ⊤ x = b} intersects the line segment ℓ k , or as a weighted Dirac when v = ±v k , i.e.,for some α k (θ) and β k (θ) that vary continuously with θ.Finally, by linearity, we obtain R∆p = k c k Rδ ℓ k . See Figure 7 for an approximate plot of R∆p.Now we compute (R * ) -1 ∆p. Recall that (R * ) -1 = KR where K = H∂ b is a filtering step with H being the Hilbert transform applied separably in the b-variable (Helgason, 1999) . For a smooth function g,where p.v. indicates a principle value integral. Therefore, for any θ ̸ = θ k , we haveand for θ = θ k we have(137)

