DEEP NETWORKS AND THE MULTIPLE MANIFOLD PROBLEM

Abstract

We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth L is large relative to certain geometric and statistical properties of the data, the network width n grows as a sufficiently large polynomial in L, and the number of i.i.d. samples from the manifolds is polynomial in L, randomly-initialized gradient descent rapidly learns to classify the two manifolds perfectly with high probability. Our analysis demonstrates concrete benefits of depth and width in the context of a practically-motivated model problem: the depth acts as a fitting resource, with larger depths corresponding to smoother networks that can more readily separate the class manifolds, and the width acts as a statistical resource, enabling concentration of the randomlyinitialized network and its gradients. The argument centers around the "neural tangent kernel" of Jacot et al. and its role in the nonasymptotic analysis of training overparameterized neural networks; to this literature, we contribute essentially optimal rates of concentration for the neural tangent kernel of deep fullyconnected ReLU networks, requiring width n ≥ L poly(d 0 ) to achieve uniform concentration of the initial kernel over a d 0 -dimensional submanifold of the unit sphere S n0-1 , and a nonasymptotic framework for establishing generalization of networks trained in the "NTK regime" with structured data. The proof makes heavy use of martingale concentration to optimally treat statistical dependencies across layers of the initial random network. This approach should be of use in establishing similar results for other network architectures.

1. INTRODUCTION

Data in many applications in machine learning and computer vision exhibit low-dimensional structure (Fig. 1a ). Although deep neural networks achieve state-of-the-art performance on tasks in these areas, rigorous explanations for their performance remain elusive, in part due to the complex interaction between models, architectures, data, and algorithms in neural network training. There is a need for model problems that capture essential features of applications (such as low dimensionality), but are simple enough to admit rigorous end-to-end performance guarantees. In addition to helping to elucidate the mechanisms by which deep networks succeed, this approach has the potential to clarify the roles of various network properties and how these should reflect the properties of the data. These considerations lead us to formulate the multiple manifold problem (Fig. 1b ), a binary classification problem in which the classes are two disjoint submanifolds of the unit sphere S n0-1 , and the classifier is a deep fully-connected ReLU network of depth L and width n trained on N i.i.d. samples from a distribution supported on the manifolds. The goal is to articulate conditions on the network architecture and number of samples under which the learned classifier provably separates the two manifolds, guaranteeing perfect generalization to unseen data. The difficulty of an instance of the multiple manifold problem is controlled by the dimension of the manifolds d 0 , their separation ∆, and their curvature κ, allowing us to study the constraints imposed by these intrinsic properties of the data on the settings of the neural network's architectural hyperparameters such that the two manifolds can be separated by training with a gradient-based method. S n 0 -foot_0 Hedgehogs Hairbrushes (a) M + M ! ; 1=5 " (b) Figure 1 : (a) Data in image classification with standard augmentation techniques, as well as other domains in which neural networks are commonly used, lies on low dimensional class manifoldsin this case those generated by the action of continuous transformations on images in the training set. Tangent vectors at a point on the manifold corresponding to an application of a rotation or a translation are illustrated in green. The dimension of the manifold is determined by the dimension of the symmetry group, and is typically small. (b) The multiple manifold problem. Our model problem, capturing this low dimensional structure, is the classification of low-dimensional submanifolds of a sphere S n0-1 . The difficulty of the problem is set by the inter-manifold separation ∆ and the curvature κ. The depth and width of the network required to provably reduce the generalization error efficiently are set by these parameters. Our main result is an analysis of the one-dimensional case of the multiple manifold problem, which reduces the analysis of the gradient descent dynamics to the construction of a certificate-showing that a certain deterministic integral equation involving the network architecture and the structure of the data admits a solution of small norm. We construct such a certificate for the simple geometry in Fig. 3 , guaranteeing generalization in this setting. Theorem 1 (informal). Let d 0 = 1. Suppose a certificate for M exists. Then if the network depth satisfies L ≥ poly(κ, C ρ , log(n 0 )), the width satisfies n ≥ poly(L, log(Ln 0 )), and the number of training samples satisfies N ≥ poly(L), randomly-initialized gradient descent on N i.i.d. samples rapidly learns a network that separates the two manifolds with overwhelming probability. The constants C ρ , κ depend only on the data density and the regularity of the manifolds. In addition, if L ∆ -1 , then a certificate exists for the configuration of M shown in Fig. 3 . Theorem 1 gives a provable generalization guarantee for a model classification problem with deep networks on structured data that depends only on the architectural hyperparameters and properties of the data. In addition, it provides an interpretable tradeoff between the architectural settings necessary to separate the two manifolds: the network depth needs to be set according to the intrinsic difficulty of the problem, and the network width needs to grow with the depth. Our analysis gives further insight into the independent roles played by each of these parameters in solving the problem, with the depth acting as a 'fitting resource', making the network's output more regular and easier to change, and the width acting as a 'statistical resource', granting concentration of the network over the random initialization around a well-behaved object that we can analyze. Moreover, the sample complexity of Theorem 1 is dictated by the intrinsic difficulty of the problem instance which is set by the geometry of the data. As a consequence, we avoid any dependence of the width of the network on the number of samples, which is common in deep network convergence results in the literature (e.g. (Allen-Zhu et al., 2019b; Du et al., 2019) , (Chen et al., 2021, Theorem 3.4 )). As is the case in practice, given a fixed architecture, more data doesn't have a detrimental effect on fitting 1 . Theorem 1 is modular, in the sense that a generalization guarantee is ensured for any geometry for which one can construct a certificate. The key to our approach will be to approximate the gradient descent dynamics with a linear discrete dynamical system defined in terms of the so-called neural tangent kernel Θ(x, x ) defined on the manifolds. Due to the structure in the data, diagonalizing the operator corresponding to this kernel is intractable in general, but we show that constructing a certificate-arguably an easier task, because it requires producing a bound on the norm of a solution to an equation rather than producing the solution itself-suffices to guarantee that the error decreases rapidly during training given a suitably structured network. We summarize the primary contributions of this work below. • Generalization in deep networks: There are few generalization results for deep networks trained efficiently with gradient descent available in the literature. 2 Theorem 1 provides such a guarantee that does not depend on any property of the trained network (e.g., norms of final weights) that is not readily available before training. In this context, the certificate condition is equivalent to the initial network function having a controlled norm in a certain RKHS; this condition is natural in the training regime we consider, and appears ubiquitously in works on generalization in shallower networks (Ghorbani et al., 2020; Ji & Telgarsky, 2020; Nitanda & Suzuki, 2021) . • Uniform concentration of the neural tangent kernel for deep ReLU networks: As an intermediate step in the proof of Theorem 1, we establish essentially optimal rates of uniform concentration for the neural tangent kernel of an arbitrarily deep network (Theorem 2) using martingale concentration, where we require the width to grow only linearly with the depth. We expect this martingale approach to be applicable to essentially any other compositionally-structured network architecture. Our uniform result generalizes prior results on pointwise concentration (Arora et al., 2019b; Allen-Zhu et al., 2019b) , analogous to our Theorem B.3, and proves useful in establishing generalization. • Strong regularity estimates for random ReLU networks: As a further consequence of the uniform concentration framework we have developed, we obtain depth-logarithmic Lipschitz estimates for random ReLU networks of arbitrary depth and linear width, as well as (for still wider networks) a uniform approximation for the network output by a constant which improves with depth, both with overwhelming probability (Section 3.3). We also control the evolution of the Lipschitz constant during NTK regime training (Lemma B.7), showing that it scales polynomially in the depth. These results may be of interest in applications where guaranteeing a Lipschitz property for networks is important, such as GAN training (Miyato et al., 2018) or denoising (Ryu et al., 2019; Sun et al., 2020) .

1.1. RELATED WORK

Deep networks and low-dimensional structure. The notion of modeling data as low-dimensional submanifolds has been widely considered in the context of clustering (Wang et al., 2015) and manifold learning (Donoho & Grimes, 2005; Fefferman et al., 2016) . Goldt et al. (2020) independently proposed the "hidden manifold model", a model problem for learning shallow neural networks for binary classification of structured data with motivations very similar to ours and which admits a mean-field analysis (Gerace et al., 2020) . The data model consists of gaussian samples from a low-dimensional subspace passed through a nonlinear function acting coordinatewise in the standard basis; although this models statistical variations around a base domain, a feature of real data that the model we study here lacks, we believe that the study of an arbitrary density supported on two Riemannian manifolds lends our data model increased structural generality. In the context of kernel regression with the kernel given by the NTK of a two-layer neural network, Ghorbani et al. (2020) study a data generating model that consists of uniform samples from a low-dimensional subsphere corrupted additively by independent uniform samples from a subsphere in the orthogonal complement, and a target mapping that depends only on the low-dimensional part. The authors obtain asymptotic generalization guarantees for this data model that reveal conditions under which the corruption degrades the performance of neural tangent methods.

Analyses of neural network training.

To reason analytically about the complicated training process, we adopt the neural tangent kernel approach (Jacot et al., 2018) . The first works to instantiate these ideas in a nonasymptotic setting obtained convergence guarantees for training deep neural networks on finite datasets (Allen- Zhu et al., 2019b; Du et al., 2019) . By exploiting more structure in the data, generalization results have been obtained (Allen-Zhu et al., 2019a; Arora et al., 2019a; Ji & Telgarsky, 2020; Oymak et al., 2019; Cao & Gu, 2019; Suzuki, 2020; Li et al., 2020; Allen-Zhu & Li, 2020 ) that apply to shallow networks, teacher-student learning scenarios, and/or hold conditional on the existence of certain small-norm interpolators. Other works have obtained generalization guarantees using generalization bounds for kernel methods (Ghorbani et al., 2019; Liang et al., 2020; Ghorbani et al., 2020; Montanari & Zhong, 2020) using the fact that the linearized predictor in the NTK regime can be linked to a kernel method (Arora et al., 2019b) . A parallel line of works (Mei et al., 2018; Tzen & Raginsky, 2020; Mei et al., 2019; Chizat & Bach, 2020; Fang et al., 2020) approach the problem by studying an infinite-width limit of neural network training that yields a different training dynamics. Approaches of this type are of interest because there is no restriction to short-time dynamics, and the limit of the dynamics can often be characterized in terms of a well-structured object, such as a max-margin classifier (Chizat & Bach, 2020) . On the other hand, it is often difficult to prove finite-time convergence to the limit.

2.1. DATA MODEL AND NETWORK DEFINITIONS

We consider data supported on the union of two class manifolds M = M + ∪ M -, where M + and M -are two disjoint smooth regular simple curves taking values in S n0-1 , with n 0 ≥ 3. We denote the data measure supported on M that generates our samples as µ ∞ , with corresponding density ρ, and write ρ min = inf x∈M ρ(x) and ρ max = sup x∈M ρ(x). We denote by κ a uniform bound on the (extrinsic) curvature of the two curves, we write ∆ = min x∈M+,x ∈M-∠(x, x ) for the separation between class manifolds, where ∠(x, x ) = cos -1 x, x for unit vectors, and to have a quantitative characterization of 'how simple' the curves are, we assume there exist constants 0 < c λ ≤ 1, K λ ≥ 1 such that for every 0 < s ≤ c λ /κ and every x, x in a common connected component of M, one has dist M (x, x ) ≤ K λ s if ∠(x, x ) ≤ s, where dist M denotes the Riemannian distance. Our target function is the signed indicator for each class manifold f (x) = 1 M+ (x) -1 M-(x). The model we consider is a fully-connected neural network with ReLU activations and access to i.i.d. samples from µ ∞ and their corresponding labels. We parameterize our neural network with weights W 1 ∈ R n×n0 , W ∈ R n×n if ∈ {2, . . . , L}, and W L+1 ∈ R 1×n , which we collect as θ = (W 1 , . . . , W L+1 ), and write the iterates of the forward pass as α θ (x) = W α -1 θ (x) + for = 1, . . . , L with α 0 θ (x) = x, which we also refer to as features or activations, with the network output written as f θ (x) = W L+1 α L θ (x), and the prediction error as ζ θ (x) = f θ (x) -f (x). For an i.i.d. sample (x 1 , . . . , x N ) from µ ∞ , we write µ N = 1 N N i=1 δ xi for the empirical measure associated to the sample, and we consider the training objective L µ N (θ) = 1 2 M (ζ θ (x)) 2 dµ N (x), i.e. the empirical risk evaluated with the square loss. We train with vanilla gradient descent with constant step size τ > 0: after randomly initializing the parameters θ N 0 as W ∼ i.i.d. N (0, 2/n) if ∈ [L] and W L+1 ∼ i.i.d. N (0, 1) we consider the sequence of iterates θ N k+1 = θ N k -τ ∇L µ N (θ N k ), where ∇L µ N represents a 'formal gradient' of the empirical loss, which we define in detail in Appendix A.1. We say the parameters obtained at iteration k of gradient descent separate the manifolds M if the classifier implemented by the neural network with the parameters θ N k labels the two manifolds correctly, i.e. if f (x) sign(f θ N k (x)) = 1 for every x ∈ M. As a shorthand, we will denote quantities evaluated at θ N k with a subscript k; an omitted subscript will denote k = 0, and we will write explicitly θ 0 = θ N 0 . Additional notation is provided in Appendix A.5.1.

2.2. ERROR DYNAMICS AND CERTIFICATES

Because it is difficult to endow the network parameters generated by the gradient iteration with a particular interpretation, we prefer to reason about how the network error ζ N k evolves under gradient descent. We calculate (in Lemma B.8) ζ N k+1 (x) = ζ N k (x) -τ M Θ N k (x, x )ζ N k (x ) dµ N (x ), (2.1) where we have defined the integral kernel Θ N k (x, x ) = 1 0 ∇f θ N k (x ), ∇f θ N k -tτ ∇L µ N (θ N k ) (x) dt, where ∇f θ0 denotes a formal gradient of the initial network function with respect to the parameters, which is defined in detail in Appendix A.1. We then define a nominal error evolution by ζ ∞ k+1 (x) = ζ ∞ k (x) -τ M Θ(x, x )ζ ∞ k (x ) dµ ∞ (x ) (2.2) with identical initial conditions ζ ∞ 0 = ζ and where Θ(x, x ) = ∇f θ0 (x), ∇f θ0 (x ) is the socalled neural tangent kernel with associated integral operator Θ. We prove that the error evolution (2.1) is well-approximated by the nominal error evolution under suitable conditions on the network width, step size, and number of samples, which together ensure that training proceeds in the "NTK regime" where Θ N k stays close to Θ. As for the nominal error evolution (2.2), we note that this system is linear, time-invariant, and stable when τ is set appropriately small, so the norm of the nominal error is guaranteed to decrease rapidly if the initial error ζ aligns well with eigenfunctions of Θ corresponding to large eigenvalues. However, computation of these eigenfunctions is intractable for general data geometries and distributions because the operator Θ is not generally translationally invariant on M. To overcome this issue, we prove this alignment implicitly by constructing an approximate solution to the linear integral equation Θ[g] = ζ such that g L 2 µ ∞ is sufficiently small. To be precise, g ∈ L 2 µ ∞ will be called a δ 1 , δ 2 -certificate for the dynamics (2.2) if Θ[g] -ζ L 2 µ ∞ ≤ δ 1 ; g L 2 µ ∞ ≤ δ 2 . (2.3)

2.3. MAIN RESULTS AND PROOF OUTLINE

Our main result is that conditional on the existence of a certificate of suitably small norm for M, gradient descent provably separates the two manifolds in time polynomial in the network depth. Theorem 1. Let M be a one-dimensional Riemannian manifold satisfying our regularity assumptions. For any 0 < δ ≤ 1/e, choose L ≥ C 1 max C µ ∞ log 9 (1/δ) log 24 (C µ ∞ n 0 log(1/δ)) , κ 2 K 2 λ /c 2 λ , n = C 2 L 99 log 9 (1/δ) log 18 (Ln 0 ), N ≥ L 10 , and fix τ such that C3 nL 2 ≤ τ ≤ C4 nL . Then if there exists a certificate in the sense of (2.3) with δ 1 = C 5 C 1/2 ρ log(1/δ) log(nn 0 )/L and δ 2 = C 6 log(1/δ) log(nn 0 )/(nρ 1/2 min ), with probability at least 1-δ over the random initialization of the network and the i.i.d. sample from µ ∞ , the parameters obtained at iteration L 39/44 /(nτ ) of gradient descent on the finite sample loss L µ N yield a classifier that separates the two manifolds. The constants C 1 , . . . , C 6 are suitably chosen absolute constants, the constants κ, K λ , c λ are respectively the extrinsic curvature constant and the global regularity constant defined in Section 2.1, the constant C ρ is defined as max{ρ min , ρ -1 min }, and the constant C µ ∞ is defined as C 15 ρ (1 + ρ max ) 6 (min {µ ∞ (M + ), µ ∞ (M -)}) -11/2 . For one-dimensional instances of the two manifold problem with sufficiently deep and overparameterized networks trained in the small-step-size regime, Theorem 1 completely reduces the analysis of the gradient iteration to the certificate problem. From a qualitative perspective, the network resource constraints imposed by Theorem 1 are natural: (i) The network depth L is set by geometric and statistical properties of the data with only a mild polylogarithmic dependence on the ambient dimension n 0 , which reflects the role of depth in controlling the capability of the network to fit functions. (ii) The network width n is set by the depth L: the inductive structure of the network causes quantities that depend on the initial random weights θ 0 to concentrate worse as the depth is increased, which can be counteracted by setting the width appropriately large. (iii) The sample complexity of N ≥ L 10 reflects the capacity of the network via the depth, and is in particular independent of the width n, which can thus be interpreted as purely a statistical resource. In addition, the conclusion of Theorem 1 implies not just that the expected generalization error with respect to µ ∞ of a binary classifier is zero, but the stronger separation property, i.e. that the generalization error will be zero for any choice of test distribution supported on M simultaneously. We give a brief sketch of the proof of Theorem 1 in Appendix A.4. To obtain a generalization guarantee from Theorem 1, it only remains to construct a certificate for M. We demonstrate this for the family of simple, highly-symmetric geometries shown in Figure 3 , and leave the case of general one-dimensional manifolds for future work. Proposition 1. Let M be an r-instance of the two circles geometry studied in Appendix C.1.1 and shown in Figure 3 , with r ≥ 1/2. For any 0 < δ ≤ 1/e, if L ≥ C 1 ∆ -1 and n ≥ C 2 L 5 log 4 (1/δ) log 4 (Ln 0 log(1/δ)), then there exists a certificate in the sense of (2.3) satisfying the requirements of Theorem 1 with probability at least 1 -3δ for some absolute constants C 1 , C 2 > 0. Taking a union bound, Proposition 1 shows that under the hypotheses of Theorem 1, with probability at least 1 -4δ a certificate exists for the geometry shown in Figure 3 as soon as L is larger than a constant multiple of the inverse separation ∆ -1 , even as the separation approaches zero. We conjecture that a similar phenomenon holds for more general geometries, possibly with additional dependencies on the curvature and global regularity parameters of M. The dependence of L on the geometry is due to the "sharpening" effect the depth has on the kernel Θ governing the dynamics and thus on the fitting capabilities of the network, as illustrated in Figure 2a . Figure 2 : (a) Depth acts as a fitting resource. As L increases, the rotationally-invariant kernel Θ (a slight modification of the deterministic kernel in Theorem 2) decays more rapidly as a function of angle between the inputs ∠(x, x ) (n is held constant). Below the curves we show an isometric chart around a point x ∈ M + . Once the decay scale of Θ is small compared to the inter-manifold distance ∆ and the curvature of M -, the network output can be changed at x while only weakly affecting its value on M -. This is one mechanism that relates the depth required to solve the classification problem to the data geometry. (b) Width acts as a statistical resource. The dynamics at initialization are governed by Θ, a random process over the network parameters. As n is increased, the normalized fluctuations of Θ around Θ decrease (here L = 10). These two phenomena are related, since the fluctuations also grow with depth, as evinced by the scaling in Theorem 2. To prove that the nominal error evolution (2.2) decreases rapidly and approximates the actual error evolution (2.1) throughout training, it is essential to have a precise characterization of the 'initial' neural tangent kernel Θ. One of our main technical contributions is to show concentration of Θ in the regime where the width n scales linearly with the depth L. Theorem 2. For any d 0 ∈ N, let M be a d 0 -dimensional complete Riemannian submanifold of S n0-1 . Then if n ≥ C 1 Ldfoot_4 0 log 4 (C M n 0 L), one has with probability at least 1 -n -10 that for every (x, x ) ∈ M × M Θ(x, x ) - n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π ≤ C 2 nL 3 d 4 0 log 4 (C M nn 0 ), where we write ν = ∠(x, x ) with an abuse of notation, ϕ ( ) denotes the -fold composition of ϕ(ν) = cos -1 (1 -ν π ) cos ν + sin ν π , the constants C 1 , C 2 > 0 are absolute, and the constant C M > 0 depends only on the diameters and curvatures of the class manifolds (Lemma C.4). 3For networks of uniform width that are wider than they are deep by a certain constant factor, we believe that the scalings in Theorem 2 are essentially optimal: the variance calculations of Hanin & Nica (2020) give some heuristic evidence here, and we believe the idea of using diagonal concentration to prove deviation lower bounds could be generalized to rigorously establish optimality. Figure 2b illustrates the phenomenon underlying Theorem 2. We discuss the proof of Theorem 2 in more detail in Sections 3.1 and 3.3.

3.1. CONCENTRATION AT INITIALIZATION: MARTINGALES AND ANGLE CONTRACTION

The initial kernel Θ is a complicated random process defined over the weights (W 1 , . . . , W L+1 ). To control it, we first show for fixed (x, x ) that Θ(x, x ) concentrates with high probability, and then leverage approximate continuity properties to pass to uniform control of Θ. Here we describe our approach to pointwise control; uniformization is discussed in Section 3.3. The kernel can be written in the form Θ(x, x ) = α L (x), α L (x ) + L-1 =0 α (x), α (x ) β (x), β (x ) , where β (x) = (W L+1 P I L (x) • • • W +2 P I +1 (x) ) * will be referred to as backward features, and P I (x) is a projection onto {α (x) > 0}. We consider β 0 (x), β 0 (x ) as a representative example: up to a small residual term, this random variable can be expressed as a sum of martingale differences. Formally, for ∈ [L], let F denote the σ-algebra generated by all weight matrices up to layer , with F 0 denoting the trivial σ-algebra. We can then write β 0 (x), β 0 (x ) -g 0 (ν 0 ) ≤ L+1 =1 g (W , . . . , W 1 , ν 0 ) -E g (W , . . . , W 1 , ν 0 ) F -1 +R (3.1) for some functions g and controllable residual R, where ν 0 = ∠(x, x ). If we fix all the variables in F -1 , the fluctuations in the -th summand will be due to W alone. Intuitively, since each weight matrix appears at most once in β 0 (x), 4 it will appear at most twice in g , and therefore g will have a subexponential distribution conditioned on F -1 and concentrate well around its conditional expectation. This property stems from the compositional structure of the network, with independent sources of randomness introduced at every layer, and is essentially agnostic to other details of the architecture. The concentration of the summands in (3.1) implies concentration of the sum: even though the summands are not independent, they can be controlled using concentration inequalities analogous to those for sums of independent variables (Azuma, 1967; Freedman, 1975) . Showing that terms of the form α (x), α (x ) concentrate in the linear regime gives rise to additional challenges. Here we exploit an essential difference between the concentration properties of the angles between features ν = ∠(α (x), α (x )) relative to those of the correlation process α (x), α (x ) studied in prior works on concentration of Θ: when ν -1 = 0, we have that ν = 0 deterministically, whereas the correlation process behaves like a subexponential random variable with small but nonzero deviations. Together with smoothness, this clamping phenomenon allows us to show concentration of the angle at layer around the function ϕ ( ) (ν 0 ), which is no larger than a constant multiple of -1 . This contraction of the angles with depth is the key to establishing Theorem 2; in addition, it gives the invariant kernel Θ (see Figure 2b ) its sharpness at zero and localization properties, both of which increase as the depth is increased and which we exploit in the proof of Proposition 1. We provide full details of our approach in Appendices D and E. By a simple argument that relies on positiveness of Θ, we show that if we can solve the certificate problem (2.3), then for a suitably chosen learning rate τ and number of iterations k (Lemma B.6)

3.2. CERTIFICATE CONSTRUCTION: GENERAL FORMULATION AND A SIMPLE EXAMPLE

P ζ ∞ k L 2 µ ∞ ≤ C ρ √ d log L L ≥ 1 -e -cd . If the network is sufficiently deep, the norm of the nominal error can thus be made arbitrarily small in a number of iterations that scales only polynomially with the problem parameters. Because our formulation of the certificate problem (2.3) accommodates approximate solutions, under a minor condition on the network width n (see Proposition 1) it suffices to solve an auxiliary system Θ[g] = ζ, where Θ and ζ are analytically-convenient approximations to Θ and ζ produced by our concentration analysis, including Theorem 2. For the simple geometry in Fig. 3 , we show in Appendix C.1.1 how to solve this auxiliary system using Fourier analysis, where we require L ∆ -1 . The depth of the network is thus determined by the geometry of the data, and specifically by the inter-manifold distance which intuitively sets the "difficulty" of the fitting problem. In Section 4 we discuss approaches to constructing certificates for general smooth curves.

3.3. UNIFORM CONCENTRATION AND ITS CONSEQUENCES

To uniformize the pointwise estimates of Section 3.1, we must overcome the issue that the backward features β (x) are not continuous functions of the input, due to the matrices P I (x) . Our approach is to discretize the input space, control the number of features that can change sign near each point in the discretization, then extend the pointwise estimates of Section 3.1 to the setting where a small number of features have changed sign-again, we find martingale concentration a necessity to achieve linear width-depth scaling. We give full details in Appendix D.3. Although Theorem 2 is the main application of these estimates-with uniform control of Θ, we can prove operator norm bounds on its corresponding integral operator Θ, which is of great help in proving generalization results-they also imply useful regularity estimates for the initial random network f θ0 . For example, we prove that networks of uniform width n n 4 0 L are with high probability n 0 (log n 0 )(log L)-Lipschitz as functions on R n0 (Theorem B.5)-in particular, the Lipschitz constant depends only logarithmically on depth, in contrast to existing results in the literature (Nguyen et al., 2020) . For networks of larger width n d 3 0 L 5 , we prove that with high probability the network f θ0 is approximately constant on the domain M ⊂ S n0-1 (Lemma D.11): sup x∈M f θ0 (x) - M f θ0 (x ) dµ ∞ (x ) 1 L . We find this result to be useful in simplifying the certificate construction problem of Section 3.2.

4. DISCUSSION

Certificates for curves. The most urgent task toward expanding the scope of Theorem 1 is the construction of certificates for geometries beyond the coaxial circles of Proposition 1. The proof of Proposition 1 relies heavily on translation invariance of the intra-and inter-manifold distances in the coaxial circles geometry in order to avoid the need for certain sharp technical estimates for the decay of the kernel Θ. With sharper control of the decay of the kernel Θ, it is possible to select the network depth in a way that grants appropriate worst-case control of the magnitude of the crossmanifold integrals in the action of Θ (as in Figure 2a ), allowing us to reduce to what is essentially a one-manifold certificate construction problem that can be solved with harmonic analysis. Beyond these considerations, it is important to extend Theorem 1 to manifolds of dimension d 0 > 1, which should be relatively straightforward. Our concentration results, notably including Theorem 2, are already applicable to manifolds of arbitrary dimension. Convolutional networks and non-differentiable manifolds. Although we have motivated our data model in the multiple manifolds problem using applications in computer vision, it is important to note that the spatially-structured image articulation manifolds that arise as data in these contexts do not carry a differentiable structure (Wakin et al., 2005) , so the assumption of bounded curvature may not be realistic here. On the other hand, in these applications it is standard to employ a convolutional network architecture. We anticipate that our martingale concentration framework can be extended to these architectures, and beyond establishing analogues of Theorem 1 in this setting, we believe it should be possible to obtain similar guarantees for models of image articulation manifolds. In particular, one might expect randomly-initialized convolutional networks to enjoy local invariance properties, like the scattering networks of Mallat (Mallat, 2012; Bruna & Mallat, 2013) , which could achieve a degree of invariant classification without expending additional network resources computing convolutions over general LCA groups (Cohen & Welling, 2016) . The importance of being low-dimensional. Ghorbani et al. (2019) show that kernel ridge regression with any rotationally invariant kernel on S d (including that of a deep network) is equivalent to polynomial regression with a degree p polynomial if the number of samples is bounded by d p+1 and d → ∞. For data lying on a low-dimensional manifold, as we consider here, one would expect less pessimistic rates; indeed, in a subsequent work (Ghorbani et al., 2020) the authors establish similar guarantees to Ghorbani et al. (2019) for a linear data model with low-dimensional structure in terms of a smaller "effective dimension". In comparison, although our present certificate construction argument only implies dynamics for the restrictive coaxial circles geometry of Figure 3 , for which one can obtain guarantees for kernel regression with a shallow NTK by the results of Ghorbani et al. (2020) , the general multiple manifold problem formulation allows one to model nonlinear structure in the data, and measures fitting difficulty through intrinsic parameters like the curvature and separation. The guarantees in Ghorbani et al. (2019; 2020) depend on the degree of approximability of the target function by low-degree polynomials, and although this achieves additional generality over our model, it seems more challenging to relate this to geometric or other types of nonlinear low-dimensional structure. The NTK regime and beyond. In recent years there has been much work devoted to the analysis of networks trained in the regime where the changes in Θ N k remain small and the dynamics in (2.1) are close to linear (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019b; Allen-Zhu & Li, 2019) (referred to as the NTK/"overparametrized"/kernel regime). Concurrently, there have also been results highlighting the limitations of this regime. In (Chizat & Bach, 2018) the authors coin the term "lazy training" in referring to dynamics where the relative change in the differential of the network function is small compared to the change in the objective during gradient descent. While the dynamics we study indeed fall into this category, the analysis makes it evident that not all lazy training regimes are created equal. Our performance guarantees depend on the structure of the kernel Θ, and on controlling the fluctuations of Θ N k around it. We are able to control these only if the width of the network is sufficiently large compared to the depth. In contrast, lazy training can also be achieved in homogeneous models by simply scaling the output of the model (Chizat & Bach, 2018) , in which case one cannot argue that the kernel has the decay properties that enable it to fit data. Our analysis hinges on staying in the NTK regime during training. We obtain suboptimal scaling of n with L in Theorem 1 because we treat all changes that occur in Θ N k during training as being adversarial to the algorithm's ability to generalize. It is likely that if an improved understanding of feature learning can be incorporated into an analysis of the dynamics, the resulting scaling requirements would be more realistic. Lénaïc Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 1305-1338. PMLR, 2020.

A EXTENDED PROBLEM FORMULATION A.1 REGARDING THE ALGORITHM

We analyze a gradient-like method for the minimization of the empirical loss L µ N . After randomly initializing the parameters θ N 0 as W ∼ i.i.d. N (0, 2/n) if ∈ [L] and W L+1 ∼ i.i.d. N (0, 1), independently of the samples x 1 , . . . , x N , we consider the sequence of iterates θ N k+1 = θ N k -τ ∇L µ N (θ N k ), (A.1) where τ > 0 is a step size, and ∇L µ N represents a 'formal gradient' of the loss L µ N , which we define as follows: first, we define formal gradients of the network output by ∇ W f θ (x) = β -1 θ (x)α -1 θ (x) * for ∈ [L] and x ∈ M, where we have introduced the definitions β θ (x) = W L+1 P I L (x) W L P I L-1 (x) . . . W +2 P I +1 (x) * for = 0, 1, . . . , L -1, and where we additionally define I (x) = supp 1 α θ (x)>0 , P I (x) = i∈I (x) e i e * i for the orthogonal projection onto the set of coordinates where the -th activation at input x is positive. We call the vectors β θ (x) the backward features or backward activations-they correspond to the backward pass of our neural network. We also define ∇ W L+1 f θ (x) = α L θ (x) * . We then define the formal gradient of the loss L µ N by ∇L µ N (θ) = M ∇f θ (x)ζ θ (x) dµ N (x). Let us emphasize again that the expressions above are definitions, not gradients in the analytical sense: we introduce these definitions to cope with nonsmoothness of the ReLU [ • ] + . On the other hand, our formal gradient definitions coincide with the expressions one obtains by applying the chain rule to differentiate L µ N at points where the ReLU is differentiable, and we will make use of this fact to proceed with these formal gradients in a manner almost identical to the differentiable setting. We reiterate here our notational conventions for quantities evaluated at these iterates: we denote evaluation of quantities such as the features and prediction error at parameters along the gradient descent trajectory using a subscript k, with an omitted subscript denoting evaluation at the initial k = 0 parameters, and we add a superscript N to parameters such as the prediction error to emphasize that they are evaluated at the parameters generated by (A.1). For example, in this notation we express ζ θ N k as ζ N k . In addition, we use θ 0 to denote the initial parameters θ N 0 . We emphasize the dependence of certain quantities on these random initial parameters notationally, including the initial network function f θ0 .

A.2 REGARDING THE DATA MANIFOLDS

We now provide additional details regarding our assumptions on the data manifolds. For background on curves and more broadly Riemannian manifolds, we refer the reader to (Lee, 2018; Absil et al., 2009) . We assume that M = M + ∪M -, where M + and M -are two disjoint complete connected 5 Riemannian submanifolds of the unit sphere S n0-1 , with n 0 ≥ 3. In particular, M ± are compact. We take as metric on these manifolds the metric induced by that of the sphere, which we take in turn as that induced by the euclidean metric on R n0 . We write µ ∞ + and µ ∞ -for the measures on M + and M -(respectively) induced by the data measure µ ∞ , and we assume that µ ∞ admits a density ρ with respect to the Riemannian measure on M, writing ρ + and ρ -for the densities on M ± induced by the density ρ. When d 0 = 1, we add additional structural assumptions to the above: we assume that M ± are smooth, simple, regular curves. Concretely, that M admits a density ρ with respect to the Riemannian measure means that 1 = M dµ ∞ (x) = M+ ρ + (x) dV + (x) + M- ρ -(x) dV -(x). When d 0 = 1, because M ± are smooth regular curves, they admit global unit-speed parameterizations with respect to arc length γ ± : I ± → S n0-1 , where I ± are intervals of the form [0, len(M ± )]. In this setting, the curvature constraint is expressed as max sup s∈I+ γ + (s) 2 , sup s∈I- γ -(s) 2 ≤ κ, and we observe that the fact that M ± are sphere curves implies κ ≥ 1. 6 Exploiting the coordinate representation of the Riemannian measure and the fixed inherited metric from R n0 , we thus have M± ρ ± (x) dV ± (x) = I± ρ ± • γ ± (t) γ ± (t) 2 dt = I± ρ ± • γ ± (t) dt. We will exploit this formula in the sequel to compare between L p µ (M) and L p (M) norms of functions defined on the manifold. More generally, we will frequently make use of similar reasoning that leverages the existence of unit-speed parameterizations for the curves. For clarity we rewrite the global regularity condition: we assume there exist constants 0 < c λ ≤ 1, K λ ≥ 1 such that ∀s ∈ (0, c λ /κ] , (x, x ) ∈ M × M , ∈ {+, -} : ∠(x, x ) ≤ s ⇒ dist M (x, x ) ≤ K λ s, (A.2) where dist M denotes the Riemannian distance between points in a common connected component, and we define C λ = K 2 λ /c 2 λ . Because M ± are simple curves, they do not self-intersect; the assumption (A.2) gives a quantitative characterization of how far the curves are from self-intersecting. We illustrate how the associated constants can be obtained from the assumption that the manifolds are simple curves: for either ∈ {+, -}, consider a connected component M ⊂ M, and for any 0 < s ≤ len(M ), define r (s) = inf x,x ∈M ×M , dist M (x,x )>s ∠(x, x ). If r (s) = 0, by compactness we can construct a sequence of pairs of points that converges to r (s), but this would imply that M is self-intersecting, contradicting our assumption that it is simple. It follows that r (s) > 0 for any value of s. If we now define Ks = r (s)/s, it follows that for any (x, x ) ∈ M × M , ∠(x, x ) ≤ s ⇒ dist M (x, x ) ≤ Ks s. Our regularity assumption implies that a single such constant holds for a range of scales below the curvature scale, which is a mild assumption since Ks approaches 1 as s approaches 0.

A.3 REGARDING THE INITIALIZATION

The manner in which we have defined our initial random neural network f θ0 is sometimes referred to as "fan-out initialization" in the literature-it guarantees that feature norms are preserved from layer to layer in the network, and thereby avoids the vanishing and exploding gradient problems. The difference between this initialization and the so called "standard" or "fan-in" initialization is only in the first and last layer weights, yet in a sufficiently deep network trained in the NTK regime the effect of any single layer is negligible and the dynamics of our network will be essentially identical to one with standard initialization. On the other hand, following the work of Jacot et al. Published as a conference paper at ICLR 2021 (2018) , it has become common in the theoretical literature to consider a different construction of the neural network called "NTK parameterization", which is in some ways more convenient for theoretical analysis. In particular, Arora et al. (2019b) prove their results on NTK concentration using this parameterization; to facilitate a comparison between our concentration result (Theorem 2) and theirs, we discuss the connection between fan-out and NTK parameterization in this section. This material is well-known and no doubt can be found already in the literature, but we believe it may be helpful to translate it into our notation. Recall our definitions for the weights and features in our neural network: we have W ∼ i.i.d. N (0, 2/n) if ∈ {0, 1 . . . , L} and W L+1 ∼ i.i.d. N (0, 1), with features defined for = 0, 1, . . . , L by α (x) = x = 0 W α -1 (x) + otherwise, and output f θ0 (x) = W L+1 α L (x). Within this section-and only within this section-we shall define auxiliary weights by G (1) ∈ R n×n0 , G ( ) ∈ R n×n for integer 1 < < L + 1, and G (L+1) ∈ R 1×n , with distributions G ( ) ∼ i.i.d. N (0, 1) for ∈ {0, 1 . . . , L + 1}, independent of everything else in the problem. As before, for ∈ {0, 1, . . . , L} we use α ( ) NTK (x) to denote the layer-features: α ( ) NTK (x) = x = 0 G ( ) α ( -1) NTK (x) + otherwise. This network's output will be written f NTK (x) = L =1 2 n G (L+1) α (L) NTK (x). By 1-homogeneity (absolute) of σ, it follows that f θ0 d = f NTK . As the notation suggests, the network f NTK corresponds to a "NTK parameterization" network-although this network and f θ0 are equivalent in terms of predictions, their "gradients" are not equivalent. The NTK for the NTK parameterization network is obtained by differentiating (at points of differentiability): after calculating (as in Lemma B.8), we introduce notation as we did for the fan-out parameterization network in Appendix A.1, so that Θ NTK (x, x ) = ∇f NTK (x), ∇f NTK (x ) , with (for = 1, . . . , L + 1) . We shall relate the NTK parameterization NTK Θ NTK to our fan-out parameterization NTK Θ using homogeneity of the ReLU. First, let us observe that ∇ G ( ) f NTK (x) = L =1 2 n β ( i ∈ [n] α (x) i > 0 d = i ∈ [n] α ( ) NTK (x) i > 0 . because [•] + is 1-homogeneous and we have G ( ) d = n/2W when ≤ L. For ∈ {0, 1, . . . , L}, we note that both α (x) and α ( ) NTK (x) depend only on the parameters (W 1 , . . . , W ) and (G (1) , . . . , G ( ) ), respectively. If we write θ = (W 1 , . . . , W L ) and θ NTK = (G (1) , . . . , G (L) ), then it follows that α (x) is a -homogeneous function of θ (and likewise for α ( ) NTK (x)). In addition, the projection matrices P I (x) are 0-homogeneous functions of θ, and so taking ∈ {0, 1, . . . , L -1} and counting parameters in the definitions of β (x) and β ( ) NTK (x) implies that these two functions are (L --1)-homogeneous functions of θ and θ NTK , respectively. Of course, for = L, they are 0-homogeneous. Thus, using that G ( ) d = n/2W for ≤ L again, we obtain ∇ G ( ) f NTK (x) d = 2/n ∇ W f θ0 (x) = 0, 1, . . . , L ∇ W f θ0 (x) = L + 1. Although we have argued equidistributionality above for each index separately for simplicity, the elementary nature of our arguments (we are just moving scalars around) and the statistical dependencies across gradients allows us to apply the same argument 'in parallel' to the sum of inner products between gradients, yielding Θ NTK (x, x ) d = α L (x), α L (x ) + 2 n L =1 α -1 (x), α -1 (x ) β -1 (x), β -1 (x ) . This expression makes it immediately clear that our concentration framework proves sharp concentration of the NTK of a uniform-width NTK parameterization feedforward ReLU network that improves over the results of Arora et al. (2019b) when the data are on the spherefoot_7 -a simple adaptation of the proof of Theorem B.3 will suffice. A.4 PROOF OUTLINE FOR THEOREM 1 In Appendix B, we prove a slightly more general version of Theorem 1 in Theorem B.1. Here, we give a brief outline of the proof of this result. Proving the separation property essentially requires us to obtain control of ζ N k L ∞ (M) , and by an interpolation inequality (Lemma B.14 ) it suffices to control the generalization error ζ N k L 2 µ ∞ and the smoothness (measured through the Lipschitz constant) of ζ N k . We start with the generalization error, picking up from where we left off at the end of Section 2.2: the triangle inequality gives ζ N k L 2 µ ∞ ≤ ζ ∞ k L 2 µ ∞ + ζ ∞ k -ζ N k L 2 µ ∞ , (A.3) which allows us to divide the analysis into two subproblems: characterizing the nominal dynamics (Lemmas B.6 and B.12) , and the nominal-to-finite transition (Lemma B.7). Beginning with the nominal dynamics, we use (2.2) to write ζ ∞ k = (Id -τ Θ) k [ζ] , where Θ denotes the operator on L 2 µ ∞ corresponding to integration against the kernel Θ and Id denotes the identity operator. The definition of Θ and compactness of M imply that Θ is a positive, compact operator (Lemma B.9), so these dynamics are stable when τ is chosen larger than the operator norm of Θ. However, the rate of decrease of ζ ∞ k L 2 µ ∞ with k could still be extremely slow if the initial error ζ has significant components in the direction of eigenfunctions of Θ corresponding to small eigenvalues, and because Θ acts roughly like a convolution operator, we expect there to exist eigenvalues arbitrarily close to zero. By solving the certificate problem (2.3), we can assert that misalignment does not occur. To solve the certificate problem, as we describe in Section 3.2, we work with analytically-convenient approximations for Θ and ζ: the exact definitions of these approximations Θ and ζ are given in Appendix A.5.2, and we prove their suitability as approximations in Theorem B.2 (a slightly more general version of Theorem 2) and Lemma D.11, respectively. As we have discussed in Section 3.1, our rates of concentration for Θ about Θ are essentially optimal-the poor rates that end up appearing in Theorem B.1 are set by later parts of the argument. With our approximation to Θ justified, we show that for any sufficiently small step size τ and number of iterations k, solving the certificate problem (2.3) guarantees appropriate decrease of the nominal generalization error; additional details are discussed in Section 3.2. The key property that we use in constructing certificates in Proposition B.4 (the 'appendix version' of Proposition 1) is that as the depth L increases, the kernel Θ sharpens and localizes (Fig. 2a ): the conditions on L in Theorem B.1 guarantee that the sharpness is sufficient to ensure that the cross-manifold integrals in the certificate problem are small in magnitude, which leads to rapid decrease of the nominal error. Our precise characterization of this phenomenon is presented in Appendix C. To complete the proof, we will justify the nominal-to-finite transition in (A.3) . Starting from the update equations (2.1) and (2.2), subtracting and rearranging gives an update equation for the difference: ζ N k -ζ ∞ k = (Id -τ Θ) ζ N k-1 -ζ ∞ k-1 -τ Θ N k-1 ζ N k-1 + τ Θ ζ N k-1 . In particular, if τ is chosen less than the operator norm of Θ, we can take norms on both sides of the previous equation, apply the triangle inequality, then exploit a telescoping series cancellation to obtain the difference bound ζ ∞ k -ζ N k L 2 µ ∞ ≤ τ k-1 s=0 M Θ N s ( • , x )ζ N s (x ) dµ N (x ) - M Θ( • , x )ζ N s (x ) dµ ∞ (x ) L 2 µ ∞ . (A.4) There are two obstacles to controlling the norm terms on the RHS of (A.4): the kernels Θ N s are distinct from the kernel Θ due to changes in the weights that occur during training, and the empirical measure µ N incurs a sampling error relative to the population measure µ ∞ . To address the first challenge, we measure the changes to the NTK during training in a worst-case fashion as ∆ N k = max i∈{0,1,...,k} Θ N i -Θ L ∞ (M×M) , and train in the NTK regime, where the network width n is larger than a large polynomial in the depth L and the total training time kτ is no larger than L/n. These conditions imply that with high probability ∆ N k is no larger than a constant multiple of n 1-δ poly(L, d 0 ) for a small constant δ > 0, so that the amortized changes during training kτ ∆ N k can be made small by sufficient overparameterization. We provide full details of this argument in Appendix F. By the preceding argument, we can use the triangle inequality and Jensen's inequality to pass from the norm term in (A.4) to a difference-of-measures term which integrates against Θ, and by Theorem B.2, we can replace the integration against Θ by an integration against a smooth, deterministic kernel, which leads to a bound ζ ∞ k -ζ N k L 2 µ ∞ ≤ R k (n, L, d 0 ) + τ k-1 s=0 M ψ 1 (∠( • , x ))ζ N s (x ) dµ N (x ) -dµ ∞ (x ) L 2 µ ∞ , where R k is a residual term that we argue is small in the NTK regime with high probability, and for concision we write ψ 1 to denote the function of ∠(x, x ) that appears in Theorem B.2. To control the remaining term, we make use of a basic result from optimal transport theory, which states that for any probability measure µ on the Borel sets of a metric space X and corresponding empirical measure µ N , one has for every Lipschitz function f X f (x) dµ(x) -dµ N (x) ≤ f Lip W µ, µ N , where W( • , • ) denotes the 1-Wasserstein metric, and concentration inequalities for empirical measures in the 1-Wasserstein metric (Weed & Bach, 2019) . To apply this result to our setting, it is necessary to control the change throughout training of the Lipschitz constant of ζ N k , and one must also account for the fact that the metric space in our setting is M, which has two distinct connected components. We treat the first issue using an inductive argument, and our treatment of the second issue (Lemma B.13) leads to the dependence on the degree of class imbalance demonstrated in the constant C µ ∞ in Theorem B.1.

A.5.1 GENERAL NOTATION

If n ∈ N, we write [n] = {1, . . . , n}. We generally use bold notation x, A for vectors, matrices, and operators and non-bold notation for scalars and scalar-valued functions. For a vector x or a matrix A, we will write entries as either x j or A ij , or (x) j or (A) ij ; we will occasionally index the rows or columns of A similarly as (A) i or (A) j , with the particular meaning made clear from context. We write [x] + = max{x, 0} for the ReLU activation function; if x is a vector, we write [x] + to denote the vector given by the application of [ • ] + to each coordinate of x, and we will generally adopt this convention for applying scalar functions to vectors. If x, x ∈ R n are nonzero, we write ∠(x, x ) = cos -1 ( x, x / x 2 x 2 ) for the angle between x and x . The vectors (e i ) denote the canonical basis for R n . We write x, y = i x i y i for the euclidean inner product on R n , and if 0 < p < +∞ we write x p = ( i |x i | p ) 1/p for the p norms (when p ≥ 1) on R n . We also write x 0 = |{i ∈ [n] | x i = 0}| and x ∞ = max i∈[n] |x i |. The unit ball in R n is written B n = {x ∈ R n | x 2 ≤ 1} , and we denote its (topological) boundary, the unit sphere, as S n-1 . We reserve the notation • for the operator norm of a m × n matrix A, defined as A = sup x 2≤1 Ax 2 ; more generally, we write A p → q = sup x p ≤1 Ax q for the corresponding induced matrix norm. For m × n matrices A and B, we write A, B = tr(A * B) for the standard inner product, where A * denotes the transpose of A, and A F = A, A for the Frobenius norm of A. The Banach space of (equivalence classes of) real-valued measurable functions on a measure space (X, µ) satisfying ( X |f | p dµ) 1/p < +∞ is written L p µ (X) or simply L p if the space and/or measure is clear from context; we write • L p for the associated norm, and • , • L 2 for the associated inner product when p = 2, with the adjoint operation denoted by * . For an operator T : L p µ → L q ν , we write T [f ] to denote the image of f under T , T i to denote the operator that applies T i times, and T L p µ →L q ν = sup f L p µ ≤1 T [f ] L q ν . We use Id to denote the identity operator, i.e. Id[g] = g for every g ∈ L p µ . We say that T is positive if f, T [f ] L 2 ≥ 0 for all f ∈ L 2 ; for example, the identity operator is positive. For an event E in a probability space, we write 1 E to denote the indicator random variable that takes the value 1 if ω ∈ E and 0 otherwise. If σ > 0, by g ∼ N (0, σ 2 I) we mean that g ∈ R n is distributed according to the standard i.i.d. gaussian law with variance σ 2 , i.e., it admits the density (2πσ 2 ) -n/2 exp(-x 2 2 /(2σ 2 )) with respect to Lebesgue measure on R n ; we will occasionally write this equivalently as g ∼ i.i.d. N (0, σ 2 ). We use d = to denote the "identically-distributed" equivalence relation. We use "numerical constant" and "absolute constant" interchangeably for numbers that are independent of all problem parameters. Throughout the text, unless specified otherwise we use c, c , c , C, C , C , K, K , K , and so on to refer to numerical constants whose value may change from line to line within a proof. Numerical constants with numbered subscripts C 1 , C 2 , . . . and so on will have values fixed at the scope of the proof of a single result, unless otherwise specified. We generally use lower-case letters to refer to numerical constants whose value should be small, and upper case for those that should be large; we will generally use K, K and so on to denote numerical constants involved in lower bounds on the size of parameters required for results to be valid. If f and g are two functions, the notation f g means that there exists a numerical constant C > 0 such that f ≤ Cg; the notation f g means that there exists a numerical constant C > 0 such that f ≥ Cg; and when both are true simultaneously we write f g. If f is a real-valued function with sufficient differentiability properties, we will write both f and ḟ for the derivative of f , and when higher derivatives are available we will occasionally denote them by f (n) , with this usage specifically made clear in context. For a metric space X and a Lipschitz function f : X → R, we write f Lip to denote the minimal Lipschitz constant of f .

A.5.2 SUMMARY OF OPERATOR AND ERROR DEFINITIONS

We collect some of the important definitions that appear throughout the main text and the appendices in this section. We begin with the NTK-type operators that appear in our analysis. Recall from Appendix A.1 our definition for the backward features: we have β θ (x) = W L+1 P I L (x) W L P I L-1 (x) . . . W +2 P I +1 (x) for the orthogonal projection onto the set of coordinates where the -th activation at input x is positive. "The" neural tangent kernel is defined as Θ(x, x ) = ∇f θ0 (x), ∇f θ0 (x ) = α L (x), α L (x ) + L-1 =0 α (x), α (x ) β (x), β (x ) , with corresponding operator on L 2 µ ∞ (M) Θ[g](x) = M Θ(x, x )g(x ) dµ ∞ (x ). As shown in Lemma B.8, this is not exactly the kernel that governs the dynamics of gradient descent: the relevant kernels in this context are defined as Θ N k (x, x ) = 1 0 ∇f θ N k (x ), ∇f θ N k -tτ ∇L µ N (θ N k ) (x) dt. We define operators Θ N k on L 2 µ N (M) corresponding to integration against these kernel in a manner analogous to the definition of Θ: Θ N k [g](x) = M Θ N k (x, x )g(x ) dµ N (x ). We then move to the deterministic approximations for Θ that we develop: we define ϕ(ν) = cos -1 ((1 -ν/π) cos ν + (1/π) sin ν), which governs the angle evolution process in the initial random network, as studied in Appendix E, and write ϕ ( ) to denote -fold composition of ϕ with itself. We define ψ 1 (ν) = n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π , which is the "output" of our main result on concentration, Theorem B.2, and ψ(ν) = n 2 L-1 =0 L-1 = 1 - ϕ ( ) (ν) π , which is at the core of the certificate construction problem. We think of ψ as an analytically-simpler version of ψ 1 , with an approximation guarantee given in Lemma C.11. Throughout these appendices, we will make use of basic properties of ψ 1 and ψ that follow from properties of ϕ without explicit reference; the source material for these types of claims is Lemma E.5, which gives elementary properties of ϕ (for example, that it takes values in [0, π/2], which implies that ψ and ψ 1 are no larger than nL/2). For derived estimates, we call the reader's attention to the contents of Appendix C.2.2; we will make explicit reference to these results when we need them, however. Although we have mentioned approximations Θ and Θ in the main text, we will prefer in these appendices to explicitly reference ψ and ψ 1 to avoid confusion; as an exception, we will use the Θ notation in Appendix C as discussed there. Our approximation for the initial prediction error is ζ(x) = -f (x) + M f θ0 (x ) dµ ∞ (x ), (A.5) where we recall f θ0 denotes the network function with the initial (random) weights. In particular, this approximates the network function with a constant, and the error as a piecewise constant function on M ± . This approximation is justified in Lemma D.11.

B.1 MAIN RESULTS

Theorem B.1. Let M be a one-dimensional Riemannian manifold satisfying our regularity assumptions. For any 0 < δ ≤ 1/e, choose L so that L ≥ C 1 max C µ ∞ log 9 (1/δ) log 24 (C µ ∞ n 0 log(1/δ)) , κ 2 C λ , let N ≥ L 10 , set n = C 2 L 99 log 9 (1/δ) log 18 (Ln 0 ), and fix τ > 0 such that C 3 nL 2 ≤ τ ≤ C 4 nL . Then if there exists a function g ∈ L 2 µ ∞ (M) such that Θ[g] -ζ L 2 µ ∞ (M) ≤ C 5 log(1/δ) log(nn 0 ) L min ρ qcert min , ρ -qcert min ; g L 2 µ ∞ (M) ≤ C 6 log(1/δ) log(nn 0 ) nρ qcert min , (B.1 ) with probability at least 1-δ over the random initialization of the network and the i.i.d. sample from µ ∞ , the parameters obtained at iteration L 39/44 /(nτ ) of gradient descent on the finite sample loss L µ N yield a classifier that separates the two manifolds. The constants C 1 , . . . , C 4 > 0 depend only on the constants q cert , C 5 , C 6 > 0, the constants κ, C λ are respectively the extrinsic curvature constant and the global regularity constant defined in Section 2.1, and the constant C µ ∞ is defined as max{ρ q min , ρ -q min }(1 + ρ max ) 6 (min {µ ∞ (M + ), µ ∞ (M -)}) -11/2 , where q = 11 + 8q cert . Proof. The proof is an application of Lemma B.7, with suitable instantiations of the parameters of that result; to avoid clashing with the probability parameter δ in this theorem, we use ε for the parameter δ appearing in Lemma B.7. Define C ρ = max{ρ min , ρ -1 min }. We will pick q = 39/44 and ε = 5/47, so that the relevant hypotheses of Lemma B.7 become (after worst-casing in the bound on N somewhat for readability) d ≥ K log(nn 0 C M ) n ≥ K max L 99 d 9 log 9 L, κ 2/5 , κ c λ 1/3 L ≥ K max{C 2qcert ρ d, κ 2 C λ } N ≥ K C 133/18+(152/27)qcert ρ (1 + ρ max ) 133/54 min {µ ∞ (M + ), µ ∞ (M -)} 19/18 d 8/3 L 9 log 3 L, and the conclusion we will appeal to becomes P ζ N L 39/44 /(nτ ) L ∞ (M) ≤ CC 1+2qcert/3 ρ (1 + ρ max ) 1/2 min {µ ∞ (M + ), µ ∞ (M -)} 1/2 d 3/4 log 4/3 L L 1/11 ≥ 1 - C Le -cd nτ . Under our choice of τ and enforcing L ≥ (2C) 11 C 11+22qcert/3 ρ (1 + ρ max ) 11/2 d 33/4 log 44/3 L (min {µ ∞ (M + ), µ ∞ (M -)}) 11/2 , (B.2) we have the equivalent result P ζ N L 39/44 /(nτ ) L ∞ (M) ≤ 1 2 ≥ 1 -L 3 e -cd ≥ 1 -e -c d , where the last bound holds when d ≥ K log L, which is redundant with the hypotheses on n and d required to use Lemma B.7. Thus, when in addition d ≥ (1/c ) log(1/δ), we obtain P ζ N L 39/44 /(nτ ) L ∞ (M) ≤ 1 2 ≥ 1 -δ. (B.3) Therefore to conclude, we need only argue that our choices of n, N , L, d, and δ in the theorem statement suffice to satisfy the hypotheses of Lemma B.7. We have already satisfied the conditions on ε and q. We notice that (B.2) implies that it suffices to enforce simply N ≥ L 10 , and following Lemma C.4, we can bound C M as in (B.62) in the proof of Lemma B.7 by C M ≤ 1 + len(M + ) µ ∞ (M + ) + len(M -) µ ∞ (M -) ≤ 2 1 + ρ max ρ min . Because n ≥ L 99 and L ≥ C ρ (1 + ρ max ), we can eliminate C M from the lower bound on d while paying only an extra factor of 2 in the constant. In addition, because κ ≥ 1 and C λ ≥ max{1, 1/c λ }, we can remove the κ 2/5 and κ c λ 1/3 lower bounds on n, since they are enforced through L already via the bound L ≥ K κ 2 C λ , worsening the absolute constant if needed. These simplifications lead us to the sufficient conditions (plus the certificate existence hypotheses) d ≥ K max{log(1/δ), log(nn 0 )} n ≥ K L 99 d 9 log 9 L L ≥ K max C 11+22qcert/3 ρ (1 + ρ max ) 11/2 d 33/4 log 44/3 L (min {µ ∞ (M + ), µ ∞ (M -)}) 11/2 , κ 2 C λ N ≥ L 10 . We ignore the condition on N below, since it matches with the theorem statement. When δ ≤ 1/e, given that n 0 ≥ 3 we have nn 0 ≥ e and max{log(1/δ), log(nn 0 )} ≤ log(1/δ) log(nn 0 ). For the sake of simplicity, we can also round up the fractional constants in the lower bound on L. We can eliminate d from these sufficient conditions by substituting the lower bound into the conditions on n and L, and this also implies that our conditions on certificate existence in the theorem statement suffice for the certificate existence hypothesis for Lemma B.7. Thus, we have the remaining sufficient conditions n ≥ KL 99 log 9 (1/δ) log 9 (nn 0 ) log 9 L L ≥ K max C 11+8qcert ρ (1 + ρ max ) 6 log 9 (1/δ) log 9 (nn 0 ) log 15 L (min {µ ∞ (M + ), µ ∞ (M -)}) 11/2 , κ 2 C λ . Using Lemma B.15 and choosing L larger than a sufficiently large absolute constant and larger than log(1/δ), we obtain that it suffices to enforce for n n ≥ KL 99 log 9 (1/δ) log 18 (Ln 0 ). In the hypotheses of the theorem, we have chosen the equality n = KL 99 log 9 (1/δ) log 18 (Ln 0 ) in the last bound. This implies log(nn 0 ) ≤ C log(Ln 0 ), so it suffices to enforce the L lower bound L ≥ K max C 11+8qcert ρ (1 + ρ max ) 6 log 9 (1/δ) log 24 (Ln 0 ) (min {µ ∞ (M + ), µ ∞ (M -)}) 11/2 , κ 2 C λ . Defining, as in the theorem C µ ∞ = C 11+8qcert ρ (1 + ρ max ) 6 (min {µ ∞ (M + ), µ ∞ (M -)}) 11/2 , and using C µ ∞ ≥ 1, we can worsen the absolute constant K in order to apply Lemma B.15 once again, obtaining the simplified condition L ≥ CK max C µ ∞ log 9 (1/δ) log 24 (C µ ∞ n 0 log(1/δ)) , κ 2 C λ . These conditions reflect what is stated in the lemma. Theorem B.2. Let M be a d 0 -dimensional Riemannian submanifold of S n0-1 . For any d ≥ Kd 0 log(nn 0 C M ), if n ≥ K d 4 L then one has on an event of probability at least 1 -e -cd sup (x,x )∈M×M Θ(x, x ) - n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π ≤ √ d 4 nL 3 , where we write ν = ∠(x, x ) in context with an abuse of notation, c, K, K > 0 are absolute constants, and C M > 0 depends only on the number of connected components of M and their diameters and curvatures (Lemma C.4). Proof. We have by the definition of Θ Θ(x, x ) = α L (x), α L (x ) + L-1 =0 α (x), α (x ) β (x), β (x ) . (B.4) Under the stated hypotheses, Lemmas D.10 and D.13 give uniform control of each of the terms appearing in this expression with suitable probability to tolerate 2L + 1 union bounds, which gives simultaneous uniform control of the factors on an event E with probability at least 1 -e -cd . Starting from (B.4), we can write with the triangle inequality Θ(x, x ) - n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π ≤ α L (x), α L (x ) + L-1 =0 α (x), α (x ) β (x), β (x ) - n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π . (B.5) By the triangle inequality, we have α (x), α (x ) β (x), β (x ) - n 2 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π ≤ α (x), α (x ) β (x), β (x ) - n 2 L-1 = 1 - ϕ ( ) (ν) π + n 2 L-1 = 1 - ϕ ( ) (ν) π α (x), α (x ) -cos ϕ ( ) (ν) . Under the conditions on n, L, and d, we have on the event E that for each sup (x,x )∈M α (x), α (x ) ≤ 2, so we can conclude that on E α (x), α (x ) β (x), β (x ) - n 2 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π ≤ 3 √ d 4 nL. The conditions on n, d, and L imply that this residual is larger than that incurred by the level-L features, which is no larger than 2. Returning to (B.5), we have shown that on E Θ(x, x ) - n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π ≤ C √ d 4 nL 3 . After adjusting the other absolute constants to absorb C into d, this gives the claim. Theorem B.3 (Pointwise Version of Theorem B.2). Let M be a d 0 -dimensional Riemannian submanifold of S n0-1 . For any d ≥ K log n, if n ≥ K max{1, d 4 L} then one has for any (x, x ) ∈ M × M P Θ(x, x ) - n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 - ϕ ( ) (ν) π ≤ √ d 4 nL 3 ≥ 1 -e -cd , where we write ν = ∠(x, x ) in context with an abuse of notation, and c, K, K > 0 are absolute constants. Proof. Follow the proof of Theorem B.2, but invoke the pointwise versions of the uniform concentration results used there (i.e., Lemmas D.1 and D.4) after rescaling d to relocate the log n terms. Proposition B.4. Let M be an r-instance of the two circles geometry studied in Appendix C.1.1, with r ≥ 1/2. For any 0 < δ ≤ 1/e, if n ≥ KL 5 log 4 (1/δ) log 4 (Ln 0 log(1/δ)) and L ≥ K (1r 2 ) -1/2 , then there exist absolute constants C 5 , C 6 > 0 and a function g such that (B.1) is satisfied with the choice q cert = 1/2 with probability at least 1 -3δ. The constants K, K > 0 are absolute. Proof. Given r ≥ 1 2 and L ≥ max{K, (π/2)(1-r 2 ) -1/2 }, we have by Lemma C.1 that there exists g such that M ψ • ∠( • , x )g(x ) dµ ∞ (x ) = ζ, with g L 2 µ ∞ ≤ (64/ √ π) ζ L ∞ (M) nρ 1/2 min . (B.6) By this bound, the triangle inequality, the Minkowski inequality, and the fact that µ ∞ is a probability measure, we have Θ[g] -ζ L 2 µ ∞ ≤ Θ -ψ • ∠ L ∞ (M×M) g L 2 µ ∞ + ζ -ζ L 2 µ ∞ ≤ C Θ -ψ • ∠ L ∞ (M×M) ζ L ∞ (M) nρ 1/2 min + ζ -ζ L ∞ (M) . (B.7) An application of Theorem B.2 and Lemma C.11 gives that on an event of probability at least 1 -e -cd Θ -ψ • ∠ L ∞ (M×M) ≤ Cn/L if d ≥ Kd 0 log(nn 0 C M ) and n ≥ K d 4 L 5 . An application of Lemma D.11 gives P ζ -ζ L ∞ (M) ≤ √ 2d L ≥ 1 -e -cd and P ζ L ∞ (M) ≤ √ d ≥ 1 -e -cd as long as n ≥ Kd 4 L 5 and d ≥ K d 0 log(nn 0 C M ), where we use these conditions to simplify the residual that appears in Lemma D.11. In particular, combining the previous two bounds with the triangle inequality and a union bound and then rescaling d, which worsens the constant c and the absolute constants in the preceding conditions, gives P ζ L ∞ (M) ≤ √ d ≥ 1 -2e -cd . Combining these bounds using a union bound and substituting into (B.7), we get that under the preceding conditions, on an event of probability at least 1 -3e -cd we have Θ[g] -ζ L 2 µ ∞ ≤ C √ d L 1 + 1 ρ 1/2 min ≤ C √ d L max{ρ 1/2 min , ρ -1/2 min }, where we worst-case the density constant in the second line, and in addition, on the same event, we have by (B.6) g L 2 µ ∞ ≤ (64/ √ π) √ d nρ 1/2 min . To conclude, we simplify the preceding conditions on n and turn the parameter d into a parameter δ > 0 in order to obtain the claimed form of the result. We have in this setting d 0 = 1, and also that C M is bounded by an absolute constant; since n 0 ≥ 3, we can thus eliminate the parameter C M from our hypotheses by adding an extra absolute constant factor. Choosing d ≥ (1/c) log(1/δ), we obtain that the previous two bounds hold on an event of probability at least 1 -3δ. When δ ≤ 1/e, given that n 0 ≥ 3 we have nn 0 ≥ e and max{log(1/δ), log(nn 0 )} ≤ log(1/δ) log(nn 0 ), so that it suffices to enforce the requirement d ≥ K log(1/δ) log(nn 0 ) for a certain absolute constant K > 0. We can then substitute this lower bound on d into the two certificate bounds above to obtain the form claimed in (B.1) with q cert = 1/2. For the hypothesis on n, we substitute this lower bound on d into the condition on n to obtain the sufficient condition n ≥ K L 5 log 4 (1/δ) log 4 (nn 0 ). Using Lemma B.15 and possibly worsening absolute constants, we then get that it suffices to enforce n ≥ K L 5 log 4 (1/δ) log 4 (Ln 0 log(1/δ)), which is the hypothesis in the result. Theorem B.5. There exist absolute constants c, C, K, K > 0 such that for any d ≥ Kn 0 log n, if n ≥ K d 4 L, then on an event of probability at least 1 -e -cd the natural extension of f θ0 to R n0 is 3 √ d-Lipschitz. Proof. The proof is a simple application of Lemma B.17, which (because f θ0 is 1-nonnegatively homogeneous and so are all its intermediate feature maps α θ0 (x)) implies that it suffices to control the Lipschitz constants of the maps and bound them on the unit sphere, together with Lemmas D.11 and D.12. In particular, for any d ≥ Kn 0 log(n) and any n ≥ K d 4 L, we have that there exists an event of probability at least 1 -e -cd on which f θ L ∞ (S n 0 -1 ) ≤ √ d, and f θ S n 0 -1 Lip ≤ √ d. Applying Lemma B.17, it follows that f θ0 : R n0 → R is 3 √ d-Lipschitz on an event of probability at least 1 -e -cd .

B.2 SUPPORTING RESULTS ON DYNAMICS

Lemma B.6 (Nominal). Suppose C err , C cert , q cert > 0 are absolute constants. Then there exist absolute constants c, c , C , C , C > 0 and absolute constants K, K , K > 0 such that for any d ≥ Kd 0 log(nn 0 C M ) and any 1/2 ≤ q ≤ 1, if n ≥ K d 4 L 5 , if L ≥ K dC 2qcert ρ , and if additionally there exists g ∈ L 2 µ ∞ (M) satisfying Θ[g] -ζ L 2 µ ∞ (M) ≤ C err C qcert ρ √ d L ; g L 2 µ ∞ (M) ≤ C cert ρ -qcert min √ d n and τ > 0 is chosen such that τ ≤ c nL , then one has P   0≤k≤L q /(nτ ) ζ ∞ k L 2 µ ∞ (M) ≤ √ d   ≥ 1 -e -cd , and in addition P   C √ d/(nτ ρ q cert min )≤k≤L q /(nτ ) ζ ∞ k L 2 µ ∞ (M) ≤ C C qcert ρ √ d log L nkτ   ≥ 1 -e -cd . Moreover, one has P   0≤k≤L q /(nτ ) k s=0 ζ ∞ s L 2 µ ∞ (M) ≤ C 2qcert ρ C d log 2 L nτ   ≥ 1 -e -cd . The constant C ρ = max{ρ min , ρ -1 min }. Proof. We will combine Lemma B.12 with various probabilistic results to obtain a simple final form for the bound from this result. Invoking Lemma B.12, we can assert that for any step size τ > 0 satisfying τ < 1 Θ L 2 µ ∞ (M)→L 2 µ ∞ (M) , (B.8) and for any k satisfying kτ ≥ 3e 2 g L 2 µ ∞ (M) ζ L ∞ (M) , (B.9) the population dynamics satisfy ζ ∞ k L 2 µ ∞ (M) ≤ √ 3 Θ[g] -ζ L 2 µ ∞ (M) - 3 g L 2 µ ∞ (M) kτ log 3 2 g L 2 µ ∞ (M) ζ L ∞ (M) kτ . (B.10) We state the bounds we will apply to simplify this expression. An application of Lemma D.11 gives P ζ -ζ L ∞ (M) ≤ √ 2d L ≥ 1 -e -cd (B.11) and P ζ L ∞ (M) ≤ √ d ≥ 1 -e -cd (B.12) as long as n ≥ Kd 4 L 5 and d ≥ K d 0 log(nn 0 C M ), where we use these conditions to simplify the residual that appears in the version of (B.11) quoted in Lemma D.11. In particular, combining (B.11) and (B.12) with the triangle inequality and a union bound and then rescaling d, which worsens the constant c and the absolute constants in the preceding conditions, gives P ζ L ∞ (M) ≤ √ d ≥ 1 -2e -cd . (B.13) In addition, we can write using the triangle inequality ζ L ∞ (M) ≥ ζ L ∞ (M) -ζ -ζ L ∞ (M) , and ζ L ∞ (M) = sup x∈M f (x) - M f θ0 (x ) dµ ∞ (x ) = max M f θ0 (x ) dµ ∞ (x ) -1 , M f θ0 (x ) dµ ∞ (x ) + 1 ≥ 1, so that, by (B.11), we have if L ≥ 2 √ d P ζ L ∞ (M) ≥ 1 2 ≥ 1 -e -cd . (B.14) Because µ ∞ is a probability measure, Jensen's inequality, the Schwarz inequality, and the triangle inequality give Θ L 2 µ ∞ (M)→L 2 µ ∞ (M) ≤ sup (x,x )∈M×M |Θ(x, x )| ≤ sup (x,x )∈M×M |Θ(x, x ) -ψ 1 • ∠(x, x )| + sup (x,x )∈M×M |ψ 1 • ∠(x, x )|, and an application of Theorem B.2 and Lemma E.5 then gives that on an event of probability at least which holds under our hypotheses. Next, on E, we write using decreasingness of x → -log x and (B.12) 1 -e -cd Θ L 2 µ ∞ (M)→L 2 µ ∞ (M) ≤ CnL (B.15) provided d ≥ Kd 0 log(nn 0 C M ) and n ≥ K d 4 L. - 3 g L 2 µ ∞ (M) kτ log 3 2 g L 2 µ ∞ (M) ζ L ∞ (M) kτ ≤ - 3 g L 2 µ ∞ (M) kτ log 3 2 g L 2 µ ∞ (M) kτ √ d = - √ 6d √ 3 g L 2 µ ∞ (M) √ 2kτ √ d log 3 2 g L 2 µ ∞ (M) kτ √ d . (B.17) By the hypothesis on g, we have on E g L 2 µ ∞ (M) ≤ Cρ -qcert min √ d n , (B.18) and so it follows that on E 3 2 g L 2 µ ∞ (M) kτ √ d ≤ C nkτ ρ qcert min . The function x → -x log x is a strictly increasing function on [0, e -1 ], so when k is chosen such that Ce nτ ρ qcert min ≤ k, (B.19) we have on E by (B.17) - 3 g L 2 µ ∞ (M) kτ log 3 2 g L 2 µ ∞ (M) ζ L ∞ (M) kτ ≤ C √ 6d nkτ ρ qcert min log C -1 nkτ ρ qcert min . (B.20) Additionally, in the context of the condition (B.9), notice that by (B.14) and (B.18) , on E we have 3e 2 g L 2 µ ∞ (M) τ ζ L ∞ (M) ≤ Ce √ d nτ ρ qcert min , so that given d ≥ 1, we have that the choice k ≥ Ce √ d nτ ρ qcert min (B.21) implies both conditions (B.9) and (B.19). We can simplify (B.20) using the hypothesis kτ ≤ L q /n with 1/2 ≤ q ≤ 1: we get nkτ ρ qcert min C ≤ L q ρ qcert min C ≤ L 1+q , where the last inequality requires L ≥ ρ qcert min /C, which implies - 3 g L 2 µ ∞ (M) kτ log 3 2 g L 2 µ ∞ (M) ζ L ∞ (M) kτ ≤ C √ d log L nkτ ρ qcert min . (B.22) The conditions we need to satisfy on kτ can be stated together as Ce √ d nρ qcert min ≤ kτ ≤ L q /n, and it is possible to satisfy these conditions simultaneously as long as L ≥ Ce √ d ρ qcert min 1/q . We obtain an upper bound C 2 e 2 d ρ 2q cert min for the quantity on the RHS of this inequality from q ≥ 1/2; it suffices to choose L larger than this upper bound instead. The other simplifications are easier: using the assumption on the norm of Θ[g] -ζ, we have Θ[g] -ζ L 2 µ ∞ (M) ≤ C qcert ρ C √ d Lρ 1/2 min . Worst-casing terms using our hypotheses on d and L to obtain a simplified bound, on E, we have thus shown that when (B.21) is satisfied, we have ζ ∞ k L 2 µ ∞ (M) ≤ CC qcert ρ √ d 1 L + log L nkτ . We have 1 L ≤ log L nkτ ⇐⇒ L log L n ≥ kτ, which is implied by the hypothesis kτ ≤ L q /n as long as L ≥ e. So we can simplify to ζ ∞ k L 2 µ ∞ (M) ≤ CC qcert ρ √ d log L nkτ . We also need a bound that works for k that do not satisfy (B.21). From the update equation for the dynamics in the proof of Lemma B.12 and the choice of τ , we also have ζ ∞ k L 2 µ ∞ (M) ≤ ζ L 2 µ ∞ (M) ≤ √ d, where the last bound is valid on E. Finally, we can obtain the claimed sum bound by calculating using our 'small-k' and 'large-k' bounds: k s=0 ζ ∞ s L 2 µ ∞ (M) = C √ d/(nτ ρ q cert min ) s=0 ζ ∞ s L 2 µ ∞ (M) + k s= C √ d/(nτ ρ q cert min ) ζ ∞ s L 2 µ ∞ (M) ≤ √ d 1 + C √ d nτ ρ qcert min + C C qcert ρ √ d log L nτ k s= C √ d/(nτ ρ q cert min ) 1 s ≤ C d nτ ρ qcert min + C C qcert ρ √ d log L nτ nτ ρ qcert min C √ d + k C √ d/(nτ ρ q cert min ) ds s ≤ Cd nτ ρ qcert min + C max ρ 2qcert min , 1 log L + C C qcert ρ √ d log 2 L nτ , where the second inequality uses standard estimates for the harmonic numbers and C √ d/(nτ ρ qcert min ) ≥ 1, which follows from τ ≤ c /(nL), d ≥ 1 and L ≥ Kρ qcert min for a suitable absolute constant K; and the third inequality integrates and simplifies, using kτ ≤ L/n and again d ≥ 1 and L ≥ Cρ qcert min . Worst-casing constants and using nτ ≤ 1, we simplify this last bound to k s=0 ζ ∞ s L 2 µ ∞ (M) ≤ max ρ 2qcert min , 1 ρ 2qcert min Cd log 2 L nτ . To see that the conditions on L in the statement of the result suffice, note that we have to satisfy (say) L ≥ Kρ qcert min and L ≥ K ρ -qcert min ; the first of these lower bounds is tighter when ρ min ≥ 1, and the second when ρ min < 1, and so it suffices to require L ≥ Kρ 2qcert min and L ≥ K ρ -2qcert min instead. Lemma B.7 (Nominal to Finite). Let d 0 = 1, and suppose C err , C cert , q cert > 0 are absolute constants. Then there exist absolute constants c, c , C , C , C > 0 and absolute constants K, K , K , K > 0 such that for any d ≥ Kd 0 log(nn 0 C M ), any 1/2 ≤ q < 1 and any 0 < δ ≤ 1, if L ≥ K max{C 2qcert ρ d, κ 2 C λ }, if n ≥ K max e 252/δ L 60+44q d 9 log 9 L, κ 2/5 , κ c λ 1/3 , and if N 1/(2+δ) ≥ K C 7/2+8qcert/3 ρ (1 + ρ max ) 7/6 e 119/(3δ) min µ ∞ (M + ) 1/2 , µ ∞ (M -) 1/2 d 5/4 L 5/2+2q log L, and if additionally there exists g ∈ L 2 µ ∞ (M) satisfying Θ[g] -ζ L 2 µ ∞ (M) ≤ C err C qcert ρ √ d L ; g L 2 µ ∞ (M) ≤ C cert ρ -qcert min √ d n and τ > 0 is chosen such that τ ≤ c nL , then one has generalization in L 2 µ ∞ (M): P ζ N L q /(nτ ) L 2 µ ∞ (M) ≤ C C qcert ρ √ d log L L q ≥ 1 - C Le -cd nτ , and in addition, one has generalization in L ∞ (M): P ζ N L q /(nτ ) L ∞ (M) ≤ C C 1+2qcert/3 ρ (1 + ρ max ) 1/2 e 14/(3δ) min {µ ∞ (M + ), µ ∞ (M -)} 1/2 d 3/4 log 4/3 L L (4q-3)/6 ≥ 1 - C Le -cd nτ . The constant C ρ = max{ρ min , ρ -1 min }. Proof. The proof controls the L ∞ norm of the error evaluated along the finite sample dynamics using an interpolation inequality for Lipschitz functions on an interval (Lemma B.14), which relates the L ∞ norm to a certain combination of the predictor's Lipschitz constant and its L 2 µ ∞ norm. We can control these two quantities at time zero using our measure concentration results; to control them for larger times 0 < k ≤ L q /(nτ ), we set up a system of coupled 'discrete integral equations' for the generalization error of the finite sample predictor and the Lipschitz constant of the finite sample predictor, and use the fact that kτ is not large to argue by induction that not much blow-up can occur. Along the way, we control the generalization error of the finite sample predictor by linking it to the generalization error of the nominal predictor as controlled in Lemma B.6; the residual that arises is shown to be small by applying Corollary B.11 and applying basic results from optimal transport theory adapted to our setting, encapsulated in Lemmas B.13 and B.16. To begin, we will lay out the probabilistic bounds we will rely on for simplifications, so that the rest of the proof can proceed without interruption. We will want to satisfy τ < 1 max Θ µ N L 2 µ N (M)→L 2 µ N (M) , Θ µ ∞ L 2 µ ∞ (M)→L 2 µ ∞ (M) , (B.23) following the notation of Lemma B.10. Using Jensen's inequality, the Schwarz inequality, and the triangle inequality, we have for ∈ {N, ∞} Θ µ L 2 µ (M)→L 2 µ (M) = sup g L 2 µ (M) ≤1 M Θ(x, x )g(x ) dµ (x ) L 2 µ (M) ≤ g L 1 µ (M) sup (x,x )∈M×M |Θ(x, x )| ≤ sup (x,x )∈M×M |Θ(x, x ) -ψ 1 • ∠(x, x )| + sup (x,x )∈M×M |ψ 1 • ∠(x, x )|, (B.24) where the notation ψ 1 follows the definition in Appendix C.2.2. The first term in (B.24) can be controlled using Theorem B.2: we obtain that on an event of probability at least 1 -e -cd Θ -ψ 1 • ∠ L ∞ (M×M) ≤ √ d 4 nL 3 (B.25) if d ≥ Kd 0 log(nn 0 C M ) and n ≥ K d 4 L. The second term in (B.24) can be controlled using the triangle inequality, Lemma E.5, and the definition of ψ 1 : we obtain that it is no larger than nL/2. Combining these two bounds, we have on an event of probability at least  1 -e -cd max Θ µ N L 2 µ N (M)→L 2 µ N (M) , Θ µ ∞ L 2 µ ∞ (M)→L 2 µ ∞ (M) ≤ CnL (B.26) provided d ≥ Kd 0 log(nn 0 C M ) and n ≥ K d 4 L. P   C √ d/(nτ ρ q cert min )≤k≤L q /(nτ ) ζ ∞ k L 2 µ ∞ (M) ≤ C C qcert ρ √ d log L nkτ   ≥ 1 - C Le -cd nτ (B.27) and P   0≤k≤L q /(nτ ) k s=0 ζ ∞ s L 2 µ ∞ (M) ≤ C 2qcert ρ C d log 2 L nτ   ≥ 1 - C Le -cd nτ (B.28) provided d ≥ Kd 0 log(nn 0 C M ), 1/2 ≤ q < 1, n ≥ K d 4 L 5 , and L ≥ K C 2qcert ρ d. We have by Lemmas B.6 and B.10, a union bound with (B.26) , and our condition on τ that P   0≤k≤L q /(nτ ) ζ ∞ k L 2 µ ∞ (M) ≤ √ d ∩ 0≤k≤L q /(nτ ) ζ N k L 2 µ N (M) ≤ √ d   ≥ 1 - CLe -cd nτ (B. 29) as long as d ≥ Kd 0 log(nn 0 C M ) and n ≥ K L 48+20q d 9 log 9 L, and where we used our conditions on τ and q to obtain that L q /nτ ≥ 1 and simplify the probability bound; and, following the notation of Corollary B.11, we have by this result (again under our condition on τ and a union bound) that there is an event of probability at least 1 -CLe -cd /(nτ ) on which ∆ N L q /(nτ ) -1 ≤ n 11 L 48+8q d 9 log 9 L 1/12 (B.30) under the previous conditions on n and d. In addition, applying Lemma D.12 and a union bound gives that on an event of probability at least 1 -Ce -cd max ζ M+ Lip , ζ M-Lip ≤ √ d (B.31) provided d ≥ Kd 0 log(nn 0 C M ) and n ≥ K max{d 4 L, (κ/c λ ) 1/3 , κ 2/5 }. Finally, we have by Lemma B.13 that for any 0 < δ ≤ 1 P   f ∈Lip(M)    M f (x) dµ ∞ (x) -M f (x) dµ N (x) ≤ 2 f L ∞ (M) √ d N + e 14/δ C µ ∞ ,M √ d max ∈{+,-} f M Lip N 1/(2+δ)      ≥ 1 -8e -d , (B.32) as long as d ≥ 1 and N ≥ 2 √ d/ min{µ ∞ (M + ), µ ∞ (M -)}. We let E(q, δ) denote the event consisting of the union of the events appearing in the bounds (B.25) to (B.32) hold; by a union bound and the previous observation that L q /nτ ≥ 1, we have P[E] ≥ 1 - C Le -cd nτ . In the sequel, we will use the events defining E to simplify our residuals without explicitly referencing that our bounds hold only on E to save time. We start from the dynamics update equations given by Lemma B.8, which we use to write ζ ∞ k -ζ N k = (Id -τ Θ) ζ ∞ k-1 -ζ N k-1 + τ Θ N k-1 ζ N k-1 -τ Θ ζ N k-1 , where Θ is defined as in Lemma B.12. Under the choice of τ and positivity of Θ (Lemma B.9), we apply the triangle inequality and a telescoping series with the common initial conditions to obtain ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ τ k-1 s=0 Θ N s ζ N s -Θ ζ N s L 2 µ ∞ (M) . (B.33) We can write Θ N s ζ N s (x) = M Θ N s (x, x )ζ N s (x ) dµ N (x ) = M Θ N s (x, x ) -ψ 1 • ∠(x, x ) ζ N s (x ) dµ N (x ) + M ψ 1 • ∠(x, x )ζ N s (x ) dµ N (x ) , and analogously M) . With this last bound and (B.34), we can use kτ ≤ L q /n to simplify (B.33) to Θ ζ N s (x) = M (Θ(x, x ) -ψ 1 • ∠(x, x )) ζ N s (x ) dµ ∞ (x ) + M ψ 1 • ∠(x, x )ζ N s (x ) dµ ∞ (x ). ≤ 2 n 11 L 48+8q d 15 log 9 L 1/12 + n 11 L 48+8q d 9 log 9 L 1/12 ζ ∞ s -ζ N s L 2 µ ∞ ( ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C L 48+20 d 15 log 9 L n 1/12 +τ n 11 L 48+8q d 9 log 9 L 1/12 k-1 s=0 ζ ∞ s -ζ N s L 2 µ ∞ (M) + τ k-1 s=0 M ψ 1 • ∠( • , x )ζ N s (x ) dµ ∞ (x ) -dµ N (x ) L 2 µ ∞ (M) .

(B.36)

To control the remaining term in (B.36), we split the error ζ N s into a Lipschitz component whose evolution is governed by the nominal kernel ψ 1 • ∠ and a nonsmooth component which is small in L ∞ . Formally, we define Θ nom : L 2 µ N (M) → L 2 µ N (M) by Θ nom [g] (x) = M ψ 1 • ∠(x, x )g(x ) dµ N (x ), and use the update equation from Lemma B.8 to write ζ N s = ζ -τ s-1 i=0 Θ N i ζ N i = ζ -τ s-1 i=0 Θ nom ζ N i ζ N,Lip s + τ s-1 i=0 Θ nom -Θ N i ζ N i δ N s , so that ζ N s = ζ N,Lip s + δ N s , and ζ N,Lip 0 = ζ, δ N 0 = 0. It is straightforward to control δ N s in L ∞ : we have (as usual) by the triangle inequality, Jensen's inequality, and the Schwarz inequality δ N s L ∞ (M) ≤ τ s-1 i=0 M ψ 1 • ∠( • , x ) -Θ N i ( • , x ) L ∞ (M) ζ N i (x ) dµ N (x ) ≤ τ s-1 i=0 ψ 1 • ∠ -Θ N i L ∞ (M×M) ζ N i L 2 µ N (M) , and then the triangle inequality together with (B.25), (B.29) and (B.30) yield δ N s L ∞ (M) ≤ sτ √ d √ d 4 nL 3 + n 11 L 48+8q d 9 log 9 L 1/12 ≤ sτ √ d n 11 L 48+8q d 9 log 9 L 1/12 , (B.37) where the second line applies the same simplifications that led us to (B.35). The triangle inequality gives M ψ 1 • ∠( • , x )δ N s (x ) dµ ∞ (x ) -dµ N (x ) L 2 µ ∞ (M) ≤ ∈{N,∞} M ψ 1 • ∠( • , x )δ N s (x ) dµ (x ) L 2 µ ∞ (M) , and simplifying as usual using Jensen's inequality and the Hölder inequality, we obtain M ψ 1 • ∠( • , x )δ N s (x ) dµ ∞ (x ) -dµ N (x ) L 2 µ ∞ (M) ≤ nL δ N s L ∞ (M) ≤ sτ n 23 L 60+8q d 15 log 9 L 1/12 , where the last bound uses (B.37). Then using the triangle inequality and kτ ≤ L q /n to simplify in (B.36), we obtain ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C L 60+32q d 15 log 9 L n 1/12 + τ n 11 L 48+8q d 9 log 9 L 1/12 k-1 s=0 ζ ∞ s -ζ N s L 2 µ ∞ (M) + τ k-1 s=0 M ψ 1 • ∠( • , x )ζ N,Lip s (x ) dµ ∞ (x ) -dµ N (x ) L 2 µ ∞ (M) .

(B.38)

To simplify the remaining term in (B.38), we aim to apply (B.32); to do this we will need to justify the notation and establish that ζ N,Lip s ∈ Lip(M) regardless of the random sample from µ ∞ and the random instance of the weights. Because ζ N,Lip s is a sum of functions, we can bound its minimal Lipschitz constant by the sum of bounds on the Lipschitz constants of each summand. We always have for either ∈ {+, -} ζ N,Lip s M Lip ≤ ζ M Lip + τ s-1 i=0 M ψ 1 • ∠( • , x )ζ N i (x ) dµ N (x ) Lip . (B.39) We note that because the ReLU [ • ] + is 1-Lipschitz as a map on R n , we have ζ M Lip ≤ W L+1 2 L =1 W < +∞, so we need only develop a Lipschitz property for the summands in the second term of (B.39). To do this, we will start by showing that t → ψ 1 • cos -1 γ (t), x is absolutely continuous for each x . Continuity is immediate. The only obstruction to differentiability comes from the inverse cosine, which fails to be differentiable at ±1, and because M ⊂ S n0-1 we have γ (t), x = ±1 only if γ (t) = ±x ; because γ are simple curves, this shows that there are at most two points of nondifferentiability in [0, len(M )]. At points of differentiability, we calculate using the chain rule the derivative t → -ψ 1 • cos -1 γ (t), x γ (t) 1 -γ (t), x 2 , x , and because γ is a sphere curve, it holds (I -γ (t)γ * (t))γ (t) = γ (t) for all t, whence by Cauchy-Schwarz γ (t) 1 -γ (t), x 2 , x = (I -γ (t)γ (t) * )x 1 -γ (t), x 2 , γ ≤ (I -γ (t)γ (t) * )x 2 1 -γ (t), x 2 ≤ 1, (B.40) where we also used that γ are unit-speed curves. In particular, the derivative is bounded, hence integrable on [0, len(M )], and so an application of (Cohn, 2013, Theorem 6.3.11 ) establishes that t → ψ 1 • cos -1 γ (t), x is absolutely continuous, with the expansion ψ 1 • cos -1 γ (t), x -ψ 1 • cos -1 γ (t ), x = t t ψ 1 • cos -1 γ (t ), x γ (t ) 1 -γ (t ), x 2 , x dt , which gives an avenue to establish Lipschitz estimates for t → ψ 1 • cos -1 γ (t), x . Because x → ζ N i (x ) is continuous and i ≤ s ≤ k ≤ L q /(nτ ) < +∞, an application of Fubini's theorem enables us to also use this result to obtain Lipschitz estimates for the summands examined in (B.39), to wit M ψ 1 • ∠( • , x )ζ N i (x ) dµ N (x ) Lip ≤ sup x∈M M |ψ 1 • ∠(x, x )| ζ N i (x ) dµ N (x ) ≤ ζ N i L 2 µ N (M) sup x∈M M (ψ 1 • ∠(x, x )) 2 dµ N (x ) 1/2 (B.41) after using the bound (B.40) in the first inequality and the Schwarz inequality for the second. Before proceeding with further simplifications, we note that the C 2 property of ψ 1 , continuity of ζ N i , boundedness of i, and compactness of M let us assert using (B.41) and (B.39) that ζ N,Lip s ∈ Lip(M) whether or not we are working on the event E. Continuing, we develop a bound for the RHS of (B.41) that is valid on E. Using the triangle inequality and the Minkowski inequality, we have for the second term on the RHS of the last bound in (B.41) (B.45) where in the second line we used N ≥ L 4+2δ . Plugging (B.45) into (B.39) and applying in addition (B.31), we get sup x∈M M (ψ 1 • ∠(x, x )) 2 dµ N (x ) 1/2 ≤ sup x∈M M (ψ 1 • ∠(x, x )) 2 dµ N (x ) -dµ ∞ (x ) 1/2 + sup x∈M M (ψ 1 • ∠(x, x )) 2 dµ ∞ (x ) sup x∈M M (ψ 1 • ∠(x, x )) 2 dµ N (x ) -dµ ∞ (x ) 1/2 ≤ Cn 2 L 4 √ d N + e 14/δ C µ ∞ ,M C n 2 L 5 √ d N 1/(2+δ) 1/2 ≤ C (1 + C µ ∞ ,M ) 1/2 e 7/δ N 1/(4+2δ) nL 5/2 d sup x∈M M (ψ 1 • ∠(x, x )) 2 dµ ∞ (x ) 1/2 ≤ CnL 2 sup x∈M± M dµ ∞ (x ) (1 + (L/π)∠(x, x )) 2 1/2 ≤ CnL 3/2 ρ 1/2 max (len(M + ) + len(M -)) 1/2 ≤ Cρ 1/2 max C 1/2 µ ∞ ,M nL 3/2 ) M ψ 1 • ∠( • , x )ζ N i (x ) dµ N (x ) Lip ≤ C ζ N i L 2 µ N (M) (1 + C µ ∞ ,M ) 1/2 e 7/δ N 1/(4+2δ) nL 5/2 d 1/4 + ρ max C µ ∞ ,M nL 3/2 ≤ C ζ N i L 2 µ N (M) (1 + C µ ∞ ,M ) 1/2 e 7/δ (1 + ρ max ) 1/2 d 1/4 nL 3/2 , ζ N,Lip s M Lip ≤ √ d + Cτ e 7/δ (1 + C µ ∞ ,M ) 1/2 (1 + ρ max ) 1/2 d 1/4 nL 3/2 s-1 i=0 ζ N i L 2 µ N (M) . (B.46) Let us briefly pause to reorient ourselves. We do not have control of the empirical losses appearing in (B.46) by an outside result, so we need to make some further simplifications to this bound. We will control the sum of empirical losses term in (B.46) by linking it to the difference population error, which we last saw in (B.38), and the population error using the triangle inequality and a change of measure inequality. Meanwhile, with the Lipschitz property of ζ N,Lip s we have shown, we will be able to obtain a bound in terms of simpler quantities for the last term on the RHS of (B.38) using (B.32). The two resulting bounds will give us a system of two coupled 'discrete integral equations' for the difference population error and the Lipschitz constants of ζ N,Lip s , which we will solve inductively. First, we continue simplifying (B.46) . The triangle inequality and the fact that µ N is a probability measure give ζ N i L 2 µ N (M) ≤ ζ N,Lip i L 2 µ N (M) + δ N i L ∞ (M) , (B.47) and we have by the triangle inequality and Hölder- 1 2 continuity of x → √ x ζ N,Lip i L 2 µ N (M) ≤ ζ N,Lip i L 2 µ ∞ (M) + ζ N,Lip i L 2 µ N (M) -ζ N,Lip i L 2 µ ∞ (M) ≤ ζ N,Lip i L 2 µ ∞ (M) + M ζ N,Lip i (x) 2 (dµ ∞ (x) -dµ N (x)) . (B.48) We have shown that ζ N,Lip i ∈ Lip(M) and ζ N,Lip i ∈ L ∞ (M) above, and so ζ N,Lip i 2 ∈ Lip(M) as well, with ζ N,Lip i 2 M Lip ≤ 2 ζ N,Lip i L ∞ (M) ζ N,Lip i M Lip . Applying the previous equation with (B.32) to control (B.48), we get ζ N,Lip i L 2 µ N (M) ≤ ζ N,Lip i L 2 µ ∞ (M) +Cd 1/4 ζ N,Lip i 2 L ∞ (M) N + e 14/δ C µ ∞ ,M ζ N,Lip i L ∞ (M) max ∈{+,-} ζ N,Lip i M Lip N 1/(2+δ) ≤ ζ N,Lip i L 2 µ ∞ (M) +Cd 1/4    ζ N,Lip i L ∞ (M) √ N + e 7/δ C 1/2 µ ∞ ,M ζ N,Lip i 1/2 L ∞ (M) max ∈{+,-} ζ N,Lip i M 1/2 Lip N 1/(4+2δ)    , where the second line applies the Minkowski inequality. Using the triangle inequality and that µ ∞ is a probability measure, we have ζ N,Lip i L 2 µ ∞ (M) ≤ ζ N i L 2 µ ∞ (M) + δ N i L 2 µ ∞ (M) ≤ ζ ∞ i L 2 µ ∞ (M) + ζ N i -ζ ∞ i L 2 µ ∞ (M) + δ N i L ∞ (M) . (B.49) Substituting (B.49) into (B.47) and using (B.37) to simplify gives ζ N i L 2 µ N (M) ≤ ζ N i -ζ ∞ i L 2 µ ∞ (M) + ζ ∞ i L 2 µ ∞ (M) + 2iτ √ d n 11 L 48+8q d 9 log 9 L 1/12 + Cd 1/4       ζ N,Lip i L ∞ (M) √ N + e 7/δ C 1/2 µ ∞ ,M ζ N,Lip i 1/2 L ∞ (M) max ∈{+,-} ζ N,Lip i M 1/2 Lip N 1/(4+2δ)       . (B.50) Following (B.46), we need to sum the previous bound over i. To simplify residuals, we use (B.28) to get Cs 2 τ √ d n 11 L 48+8q d 9 log 9 L 1/12 + s-1 i=0 ζ ∞ i L 2 µ ∞ (M) ≤ Cs 2 τ √ d n 11 L 48+8q d 9 log 9 L 1/12 + C 2qcert ρ C d log 2 L nτ ≤ 2C 2qcert ρ C d log 2 L nτ , where the second bound uses the control sτ ≤ kτ ≤ L q /n and holds under the condition n ≥ (C/C ) 12 L 48+32q d 3 . Summing in (B.50) and using the previous bound, it follows s-1 i=0 ζ N i L 2 µ N (M) ≤ CC 2qcert ρ d log 2 L nτ + s-1 i=0 ζ N i -ζ ∞ i L 2 µ ∞ (M) + Cd 1/4       ζ N,Lip i L ∞ (M) √ N + e 7/δ C 1/2 µ ∞ ,M ζ N,Lip i 1/2 L ∞ (M) max ∈{+,-} ζ N,Lip i M 1/2 Lip N 1/(4+2δ)       . (B.51) Plugging (B.51) into (B.46), we obtain ζ N,Lip s M Lip ≤ C 1 d 1/4 L 3/2    d log 2 L + nτ s-1 i=0 ζ N i -ζ ∞ i L 2 µ ∞ (M) +nτ d 1/4 s-1 i=0 ζ N,Lip i L ∞ (M) √ N + ζ N,Lip i 1/2 L ∞ (M) max ∈{+,-} ζ N,Lip i M 1/2 Lip N 1/(4+2δ)    , (B.52) where for concision we have defined  C 1 (δ, µ ∞ ) = CC 2qcert ρ e 14/δ (1 + C µ ∞ ,M ) (1 + ρ max ) 1/2 (B. ψ 1 • ∠(x, • )ζ N,Lip s ∈ Lip(M) as well, with ψ 1 • ∠(x, • )ζ N,Lip s M Lip ≤ CnL max ∈{+,-} ζ N,Lip s M Lip + C nL 2 ζ N,Lip s L ∞ (M) (B.54) using the definition of ψ 1 , Lemmas E.5, C.7 and C.22, and ψ 1 • ∠(x, • )ζ N,Lip s L ∞ (M) ≤ CnL ζ N,Lip s L ∞ (M) . (B.55)  ) M ψ 1 • ∠( • , x )ζ N,Lip s (x ) dµ ∞ (x ) -dµ N (x ) L 2 µ ∞ (M) ≤ CnL √ d ζ N,Lip s L ∞ (M) N + CnLe 14/δ C µ ∞ ,M √ d max ∈{+,-} ζ N,Lip s M Lip N 1/(2+δ) + CnL 2 e 14/δ C µ ∞ ,M √ d ζ N,Lip s L ∞ (M) N 1/(2+δ) , and we can combine the first and third terms on the RHS of the previous bound by worst-casing, giving M ψ 1 • ∠( • , x )ζ N,Lip s (x ) dµ ∞ (x ) -dµ N (x ) L 2 µ ∞ (M) ≤ C √ dnLe 14/δ (1 + C µ ∞ ,M ) N 1/(2+δ) max ∈{+,-} ζ N,Lip s M Lip + L ζ N,Lip s L ∞ (M) . Plugging the previous bound into (B.38), we obtain ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C L 60+32q d 15 log 9 L n 1/12 + τ n 11 L 48+8q d 9 log 9 L 1/12 k-1 s=0 ζ ∞ s -ζ N s L 2 µ ∞ (M) + Cτ √ dnLe 14/δ (1 + C µ ∞ ,M ) N 1/(2+δ) k-1 s=0 max ∈{+,-} ζ N,Lip s M Lip + L ζ N,Lip s L ∞ (M) . (B.56) To finish coupling (B.52) and (B.56), we need to remove the L ∞ (M) terms. We accomplish this using Lemma B.14, which gives ζ N,Lip s L ∞ (M) ≤ CC 1/2 2 ζ N,Lip s L 2 µ ∞ (M) + C ρ 1/3 min ζ N,Lip s 2/3 L 2 µ ∞ (M) max ∈{+,-} ζ N,Lip s M 1/3 Lip , (B.57) where we have defined C 2 (µ ∞ ) = ρ max ρ min min {µ ∞ (M + ), µ ∞ (M -)} . (B.58) For coupling purposes, it will suffice to use a version of (B.57) obtained by simplifying with some coarse estimates. Using (B.49), (B.37) and (B.29), we have ζ N,Lip i L 2 µ ∞ (M) ≤ √ d + iτ √ d n 11 L 48+8q d 9 log 9 L 1/12 + ζ N i -ζ ∞ i L 2 µ ∞ (M) ≤ 2 √ d + ζ N i -ζ ∞ i L 2 µ ∞ (M) , using iτ ≤ L q /n and n ≥ L 48+20q d 9 log 9 L in the second line, and plugging this into (B.57) and using the Minkowski inequality gives ζ N,Lip s L ∞ (M) ≤ CC 1/2 2 √ d + CC 1/2 2 ζ N i -ζ ∞ i L 2 µ ∞ (M) + Cd 1/3 ρ 1/3 min max ∈{+,-} ζ N,Lip s M 1/3 Lip + C ρ 1/3 min ζ N i -ζ ∞ i 2/3 L 2 µ ∞ (M) max ∈{+,-} ζ N,Lip s M 1/3 Lip . (B.59) To make some of the subsequent bounds more concise, we introduce additional notation Λ s = max ∈{+,-} ζ N,Lip s M Lip . Plugging (B.59) into (B.52) and using the Minkowski inequality, we obtain Λ s ≤ CC 1 d 1/4 L 3/2 d log 2 L + C 1/2 2 d 3/4 nsτ √ N + nτ 1 + C 1/2 2 d 1/4 √ N s-1 i=0 ζ N i -ζ ∞ i L 2 µ ∞ (M) + nτ d 7/12 ρ 1/3 min √ N s-1 i=0 Λ 1/3 i + C 1/4 2 nτ d 1/2 N 1/(4+2δ) s-1 i=0 Λ 1/2 i + nτ d 5/12 ρ 1/6 min N 1/(4+2δ) s-1 i=0 Λ 2/3 i + nτ d 1/4 ρ 1/3 min √ N s-1 i=0 ζ N i -ζ ∞ i 2/3 L 2 µ ∞ (M) Λ 1/3 i + C 1/4 2 nτ d 1/4 N 1/(4+2δ) s-1 i=0 ζ N i -ζ ∞ i 1/2 L 2 µ ∞ (M) Λ 1/2 i + nτ d 1/4 ρ 1/6 min N 1/(4+2δ) s-1 i=0 ζ N i -ζ ∞ i 1/3 L 2 µ ∞ (M) Λ 2/3 i . (B.60) To simplify (B.60), we use sτ ≤ L q /n, C 2 ≥ 1, and q ≤ 1, and so if additionally we choose N ≥ C 2 max{ √ d, L 2 } we obtain Λ s ≤ CC 1 d 1/4 L 3/2 d log 2 L + nτ s-1 i=0 ζ N i -ζ ∞ i L 2 µ ∞ (M) + nτ d 1/4 ρ 1/6 min N 1/(4+2δ) s-1 i=0 ζ N i -ζ ∞ i 1/3 L 2 µ ∞ (M) Λ 2/3 i + nτ d 7/12 ρ 1/3 min √ N s-1 i=0 Λ 1/3 i + C 1/4 2 nτ d 1/2 N 1/(4+2δ) s-1 i=0 Λ 1/2 i + nτ d 5/12 ρ 1/6 min N 1/(4+2δ) s-1 i=0 Λ 2/3 i + nτ d 1/4 ρ 1/3 min √ N s-1 i=0 ζ N i -ζ ∞ i 2/3 L 2 µ ∞ (M) Λ 1/3 i + C 1/4 2 nτ d 1/4 N 1/(4+2δ) s-1 i=0 ζ N i -ζ ∞ i 1/2 L 2 µ ∞ (M) Λ 1/2 i . (B.61) Meanwhile, we recall C µ ∞ ,M = len(M + ) µ ∞ (M + ) + len(M -) µ ∞ (M -) , and an integration in coordinates gives  µ ∞ (M ± ) = len(M±) 0 ρ ± • γ ± (t) dt ≥ ρ min len(M ± ), so that C µ ∞ ,M ≤ 2 ρ min . (B. ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C L 60+32q d 15 log 9 L n 1/12 + τ n 11 L 48+8q d 9 log 9 L 1/12 + C C 1/2 2 √ dnL 2 e 14/δ ρ min N 1/(2+δ) k-1 s=0 ζ ∞ s -ζ N s L 2 µ ∞ (M) + C τ √ dnLe 14/δ ρ min N 1/(2+δ) k-1 s=0 Λ s + LC 1/2 2 √ d +Ld 1/3 ρ -1/3 min Λ 1/3 s + Lρ -1/3 min ζ ∞ s -ζ N s 2/3 L 2 µ ∞ (M) Λ 1/3 s . (B.63) In (B.61) and (B.63), we now have a suitable system of coupled discrete integral equations for M) and Λ k . We will solve these equations by positing bounds for each parameter that are valid for all indices 0 ≤ k ≤ L q /(nτ ) based on inspection of (B.61) and (B.63), then proving the bounds hold by induction on k. Positing the bounds is not too hard, because each term in (B.61) and (B.63) with a factor of N in its denominator can be forced to be small by requiring N to be large enough. For all 0 ≤ k ≤ L q /(nτ ) , we claim ζ ∞ k -ζ N k L 2 µ ∞ ( ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C diff max    C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) , L 60+32q d 15 log 9 L n 1/12    (B.64) Λ k ≤ C lip C 1 d 5/4 L 3/2 log 2 L, (B.65) where C diff and C lip are two absolute constants that we will specify in our arguments below. We prove (B.64) and (B.65) by induction on k.  The case of k = 0 is immediate, since ζ ∞ 0 = ζ N 0 for (B. Λ s + LC 1/2 2 √ d + L dΛ s ρ min 1/3 ≤ C lip C 1 d 5/4 L 3/2 log 2 L + LC 1/2 2 √ d + C lip C 1 ρ min 1/3 d 3/4 L 3/2 log 2/3 L ≤ C lip C 1 C 1/2 2 C 1/3 ρ d 5/4 L 3/2 log 2 L , where we worst-cased in the second line using C lip ≥ 1 and C 1 ≥ 1, C 2 ≥ 1, which follow from (B.53) and (B.58). We use kτ ≤ L q /n with the last bound to note that C τ √ dnLe 14/δ ρ min N 1/(2+δ) k-1 s=0 C lip C 1 C 1/2 2 C 1/3 ρ d 5/4 L 3/2 log 2 L ≤ C C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) , where C ≥ 1. Using this bound and (B.65) once more, we can simplify (B.63) to ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) + C L 60+32q d 15 log 9 L n 1/12 + τ n 11 L 48+8q d 9 log 9 L 1/12 + C C 1/2 2 √ dnL 2 e 14/δ ρ min N 1/(2+δ) k-1 s=0 ζ ∞ s -ζ N s L 2 µ ∞ (M) + C C 1/3 lip C 1/3 1 e 14/δ ρ 4/3 min τ d 11/12 nL 5/2 log 2/3 L N 1/(2+δ) k-1 s=0 ζ ∞ s -ζ N s 2/3 L 2 µ ∞ (M) . (B.66) Noticing that the RHS of the bound (B.64) does not depend on k, let us momentarily denote it by C diff M (i.e., the part of the RHS of this bound that does not involve C diff is denoted as M ). Plugging into (B.66) and using kτ ≤ L q /n, we obtain ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) + C L 60+32q d 15 log 9 L n 1/12 + C diff L 48+20q d 9 log 9 L n 1/12 + C C 1/2 2 √ dL 2+q e 14/δ ρ min N 1/(2+δ) M + C C 2/3 diff C 1/3 lip C 1/3 1 e 14/δ ρ 4/3 min d 11/12 L 5/2+q log 2/3 L N 1/(2+δ) M 2/3 . In particular, if C diff = 6 max{C, C } (for the constants in the first line of the previous bound), we can bound the RHS of the previous bound and obtain ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C diff M 3 + C diff L 48+20q d 9 log 9 L n 1/12 + C C 1/2 2 √ dL 2+q e 14/δ ρ min N 1/(2+δ) M + C C 2/3 diff C 1/3 lip C 1/3 1 e 14/δ ρ 4/3 min d 11/12 L 5/2+q log 2/3 L N 1/(2+δ) M 2/3 . (B.67) We can conclude (B.64) from (B.67) provided we can show the second and third terms are no larger than C diff M/3. For the second term in (B.67), if we choose N such that N 1/(2+δ) ≥ 6C C 1/2 2 ρ -1 min e 14/δ d 1/2 L 2+q and n such that n ≥ 6 12 L 48+20q d 9 log 9 L then we have C diff L 48+20q d 9 log 9 L n 1/12 + C C 1/2 2 √ dL 2+q e 14/δ ρ min N 1/(2+δ) M ≤ C diff M 3 . For the third term in (B.67), we proceed in cases: first, when C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) ≤ L 60+32q d 15 log 9 L n 1/12 , (B.68) we have by (B.64) M = L 60+32q d 15 log 9 L n 1/12 , and if we require additionally  C diff ≥ 1, it follows that C C 2/3 diff C 1/3 lip C 1/3 1 e 14/δ ρ 4/3 min d 11/12 L 5/2+q log 2/3 L N 1/(2+δ) M 2/3 ≤ C C diff C lip C 1 C 1/2 2 C 4/3 ρ e 14/δ d 7/4 L 5/2+q log 2 L N 1/(2+δ) M 2/3 ≤ C C diff e 14/δ M 1+2/3 , using C 1 ≥ 1, C 2 ≥ 1, C C 2/3 diff C 1/3 lip C 1/3 1 e 14/δ ρ 4/3 min d 11/12 L 5/2+q log 2/3 L N 1/(2+δ) M 2/3 ≤ C diff M 3 , as desired. Next, we consider the remaining case C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) ≥ L 60+32q d 15 log 9 L n 1/12 , (B.69) which by (B.64) implies 2+δ) . M = C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/( With this setting of M , the third term in (B.67) can be bounded as C C 2/3 diff C 1/3 lip C 1/3 1 e 14/δ ρ 4/3 min d 11/12 L 5/2+q log 2/3 L N 1/(2+δ) M 2/3 = C C 2/3 diff C lip C 1 C 1/3 2 C 4/3+8/9 ρ e 14/δ d 7/4+1/3 L 5/2+q L 5/3+2q/3 log 2 L N 1/(2+δ)+2/(6+3δ) ≤ C C diff C 8/9 ρ e 14/δ d 1/3 L 5/3+2q/3 N 2/(6+3δ) M, and using the RHS of the final bound in the previous expression, we see that if we choose N 1/(2+δ) ≥ (3C ) 3/2 C 4/3 ρ e 21/δ d 1/2 L 5/2+q , then we have for the case (B.69) C C 2/3 diff C 1/3 lip C 1/3 1 e 14/δ ρ 4/3 min d 11/12 L 5/2+q log 2/3 L N 1/(2+δ) M 2/3 ≤ C diff M 3 . Combining the bounds on the third term in (B.67) over both cases (B.68) and (B.69), we have shown which proves (B.64) . Next, to verify (B.65), we proceed with a similar idea: the bound claimed in (B.65) corresponds to a constant multiple of the first term in parentheses in (B.61), so to establish (B.65) it suffices to show that each of the other terms in (B.61) is no larger than a certain constant. To work with the maximum operation in (B.64), we will again split the analysis into two cases. First, we consider the case where (B.69) holds, so that the maximum in (B.64) is achieved by the second argument. Plugging (B.64) and (B.65) into (B.61) and using kτ ≤ L q /n, we get 2+δ) .  and C 2 ≥ 1, we can worst-case constants in the previous expression to simplify. We can then do some selective worst-casing of the exponents on d, N , and L in all except the first term: we have evidently (to combine the first and last terms) ζ ∞ k -ζ N k L 2 µ ∞ (M) ≤ C diff M, Λ k ≤ CC 1 d 1/4 L 3/2 d log 2 L + C diff C lip C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+2q log 2 L N 1/(2+δ) + C 1/3 diff C lip C 11/18 ρ C 1 C 1/6 2 d 5/3 L 11/6+4q/3 log 2 L N 6/(10+5δ) + C 1/3 lip C 1/3 1 C 1/3 ρ dL 1/2+q log 2/3 L √ N + C 1/2 lip C 1/2 1 C 1/4 2 d 9/8 L 3/4+q log L N 1/(4+2δ) + C 2/3 lip C 2/3 1 C 1/6 ρ d 5/4 L 1+q log 4/3 L N 1/(4+2δ) + C 2/3 diff C lip C 1 C 1/3 2 C 5/3 ρ d 11/6 L 13/6+5q/3 log 2 L N (5+δ)/(4+2δ) + C 1/2 diff C lip C 1 C 1/2 2 C 2/3 ρ d 7/4 L 2+3q/2 log 2 L N 1/( Using C lip ≥ 1, C 1 ≥ 1, C ρ ≥ 1, d 7/4 L 2+3q/2 log 2 L N 1/(2+δ) ≤ d 7/4 L 5/2+2q log 2 L N 1/(2+δ) and (to combine the first and second terms) 2+δ) , and because 0 < δ ≤ 1, we have 1/(2 + δ) ≤ 1/2 and (5 + δ)/(4 + 2δ) ≥ 1, and if N ≥ d 1/12 this implies (to combine the first and second-to-last terms) 2+δ) . d 5/3 L 11/6+4q/3 log 2 L N 6/(10+5δ) ≤ d 7/4 L 5/2+2q log 2 L N 1/( d 11/6 L 13/6+5q/3 log 2 L N (5+δ)/(4+2δ) ≤ d 7/4 L 5/2+2q log 2 L N 1/( We can worst-case the remaining three terms, and we thus obtain Λ k ≤ CC 1 d 1/4 L 3/2 d log 2 L + 4C diff C lip C 1 C 1/2 2 C 5/3 ρ d 1+3/4 L 5/2+2q log 2 L N 1/(2+δ) +3C 2/3 lip C 2/3 1 C 1/4 2 C 1/3 ρ d 1+1/4 L 1+q log 4/3 L N 1/(4+2δ) . We can then pick C lip = 3C, and if N 1/(4+2δ) ≥ 3(3C) 2/3 C 2/3 1 C 1/4 2 C 1/3 ρ d 1/4 L 1+q , and N 1/(2+δ) ≥ 12CC diff C 1 C 1/2 2 C 5/3 ρ d 3/4 L 5/2+2q , then it follows from the previous bound Λ k ≤ 3CC 1 d 5/4 L 3/2 log 2 L, which establishes (B.65) in the first case, where (B.69) holds. Next, we consider the remaining case where (B.68) holds, so that the maximum in (B.64) is saturated by the first argument. We start by grouping some terms in (B.61) so that it will be slightly easier to simplify later: we can write Λ k ≤ CC 1 d 1/4 L 3/2 d log 2 L + nτ k-1 s=0 ζ N s -ζ ∞ s L 2 µ ∞ (M) + nτ d 1/4 ρ 1/6 min N 1/(4+2δ) k-1 s=0 ζ N s -ζ ∞ s 1/3 L 2 µ ∞ (M) + d 1/6 Λ 2/3 s + nτ d 1/4 ρ 1/3 min √ N k-1 s=0 ζ N s -ζ ∞ s 2/3 L 2 µ ∞ (M) + d 1/3 Λ 1/3 s + C 1/4 2 nτ d 1/4 N 1/(4+2δ) k-1 s=0 ζ N s -ζ ∞ s 1/2 L 2 µ ∞ (M) + d 1/4 Λ 1/2 s . (B.70) By the case-defining condition (B.68) and (B.64), enforcing n ≥ C 12 diff L 60+32q d 9 log 9 L implies ζ N s -ζ ∞ s L 2 µ ∞ (M) + d 1/2 ≤ 2d 1/2 , and we can use this to simplify (B.70), obtaining Λ k ≤ CC 1 d 1/4 L 3/2 d log 2 L + nτ k-1 s=0 ζ N s -ζ ∞ s L 2 µ ∞ (M) + 2nτ d 5/12 ρ 1/6 min N 1/(4+2δ) k-1 s=0 Λ 2/3 s + 2nτ d 7/12 ρ 1/3 min √ N k-1 s=0 Λ 1/3 s + 2C 1/4 2 nτ d 1/2 N 1/(4+2δ) k-1 s=0 Λ 1/2 s . (B.71) Plugging (B.64) and (B.65) into (B.71) and using kτ ≤ L q /n and (B.68), we get the bound  Λ k ≤ CC 1 d 1/4 L 3/2 d log 2 L + C diff L 60+44q d 15 log 9 L n 1/12 + 2C 2/3 lip C 2/3 1 C 1/6 ρ d 1+1/4 L 1+q log 2/3 L N 1/(4+2δ) + 2C 1/3 lip C 1/3 1 C 1/3 ρ dL 1/2+q log 2/3 L √ N + 2C 1/2 lip C 1/2 1 C 1/4 2 d 1+1/8 L 3/4+q log L N 1/(4+2δ ≤ k ≤ L q /(nτ ) . We can now wrap up the proof: we will obtain the desired conclusion by plugging the results we have developed into (B.57) and simplifying. Plugging (B.37), (B.27) and (B.64) into (B.49) and bounding the maximum by the sum, we get ζ N,Lip L q /(nτ ) L 2 µ ∞ (M) ≤ ζ ∞ L q /(nτ ) L 2 µ ∞ (M) + ζ N L q /(nτ ) -ζ ∞ L q /(nτ ) L 2 µ ∞ (M) + δ N L q /(nτ ) L ∞ (M) ≤ CC q cert ρ √ d log L nτ L q /(nτ ) + C L 60+32q d 15 log 9 L n 1/12 +C C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) ≤ CC qcert ρ √ d log L L q + C L 60+32q d 15 log 9 L n 1/12 + C C 1 C 1/2 2 C 4/3 ρ d 7/4 L 5/2+q log 2 L N 1/(2+δ) ≤ CC qcert ρ √ d log L L q , (B.73) where in the third inequality we apply L q /(nτ ) ≥ L q /(2nτ ), which follows from our choice of step size, and in the fourth inequality we simplify residuals using n ≥ (C /C) 12 d 9 L 60+44q and N 1/(2+δ) ≥ C C 1 C 1/2 2 C 4/3 ρ d 5/4 L 5/2+2q log L. Applying (B.73), the triangle inequality (with (B.37) and the fact that µ ∞ is a probability measure) and our previous choice of large n, we get ζ N L q /(nτ ) L 2 µ ∞ (M) ≤ CC qcert ρ √ d log L L q , (B.74) i.e. generalization in L 2 µ ∞ (M). We can bootstrap generalization in L ∞ (M) from (B.73) using the triangle inequality and (B.57): we get ζ N L q /(nτ ) L ∞ (M) ≤ CC 1/2 2 ζ N,Lip L q /(nτ ) L 2 µ ∞ (M) + C ρ 1/3 min ζ N,Lip L q /(nτ ) 2/3 L 2 µ ∞ (M) Λ 1/3 L q /(nτ ) + δ N L q /(nτ ) L ∞ (M) ≤ CC 1/2 2 C qcert ρ √ d log L L q + C C 1/3 1 ρ 1/3 min d 3/4 L (3-4q)/6 log 4/3 L min ρ 1/3 min , 1 , where in the second line we apply (B.37) and our previous choice of large n to absorb the residual from δ N L q /(nτ ) , and apply (B.65) to bound the Λ 1/3 L q /(nτ ) term. Worst-casing the errors in the previous bound, we obtain ζ N L q /(nτ ) L ∞ (M) ≤ C C qcert ρ C 1/2 2 + C 1/3 1 C 2/3 ρ d 3/4 log 4/3 L L (4q-3)/6 . To conclude, we will tally dependencies and make some simplifications to show the conditions stated in the result suffice. Recalling (B.53) and (B.58) and using (B.62), we have C 1 ≤ CC 2qcert+1 ρ (1 + ρ max ) 1/2 e 14/δ , so we can simplify to C 1/2 ρ C 1/2 2 + C 1/3 1 C 2/3 ρ ≤ C ρ ρ max min {µ ∞ (M + ), µ ∞ (M -)} 1/2 + CC 1+2qcert/3 ρ (1 + ρ max ) 1/6 e 14/(3δ) ≤ CC 1+2qcert/3 ρ (1 + ρ max ) 1/2 e 14/(3δ) min {µ ∞ (M + ), µ ∞ (M -)} 1/2 . We can use this to obtain a simplified generalization in L ∞ (M) bound from our previous expression: it becomes ζ N L q /(nτ ) L ∞ (M) ≤ CC 1+2qcert/3 ρ (1 + ρ max ) 1/2 e 14/(3δ) min {µ ∞ (M + ), µ ∞ (M -)} 1/2 d 3/4 log 4/3 L L (4q-3)/6 , (B.75) which can be made nonvacuous when q > 3/4. Tallying dependencies, we find after worst-casing (and using q ≥ 1/2 and some interdependences between parameters to simplify) that it suffices to choose N such that N 1/(2+δ) ≥ CC 4/3 1 C 1/2 2 C 5/3 ρ e 21/δ d 5/4 L 5/2+2q log L, the depth L such that L ≥ C max{C 2qcert ρ d, κ 2 C λ }, the width n such that n ≥ C max e 252/δ L 60+44q d 9 log 9 L, κ 2/5 , κ c λ 1/3 , and d such that d ≥ Cd 0 log(nn 0 C M ). Unpacking the constants in the condition on N , we see that it suffices to choose N such that N 1/(2+δ) ≥ CC 7/2+8qcert/3 ρ (1 + ρ max ) 7/6 e 119/(3δ) min {µ ∞ (M + ), µ ∞ (M -)} 1/2 d 5/4 L 5/2+2q log L.

B.3 AUXILIARY RESULTS

Lemma B.8. Defining a kernel Θ N k (x, x ) = 1 0 ∇f θ N k (x ), ∇f θ N k -tτ ∇L µ N (θ N k ) (x) dt and corresponding operator on L 2 µ N (M) Θ N k [g](x) = M Θ N k (x, x )g(x ) dµ N (x ), we have that Θ N k is bounded, and ζ N k+1 = Id -τ Θ N k ζ N k . Proof. By the definition of the gradient iteration, we have that ζ N k+1 -ζ N k = f θ N k -τ ∇L µ N (θ N k ) -f θ N k . The total number of trainable parameters in the network is M = n(n(L -1) + n 0 + 1), and the euclidean space in which θ lies is isomorphic to R M . For k ∈ N 0 , define paths γ N k : [0, 1] → R M by γ N k (t) = θ N k -tτ ∇L µ N (θ N k ), so that ζ N k+1 -ζ N k = f γ N k (1) -f γ N k (0) . We will justify a first-order Taylor representation in integral form based on the previous expression by arguing that for every x ∈ M, t → f γ N k (t) (x) is absolutely continuous on [0, 1], by checking the hypotheses of (Cohn, 2013, Theorem 6.3.11) . Because γ N k is smooth and  f ( • ) (x) is continuous, f γ N k (t t → f γ N k (t) (x) at t ∈ [0, 1] is equal to -τ ∇L µ N (θ N k ), ∇f γ N k (t) ) at all but countably many points t ∈ [0, 1]. We finally need to check integrability of this derivative on [0, 1]. We have by linearity -τ ∇L µ N (θ N k ), ∇f γ N k (t) (x) = -τ M ζ θ N k (x ) ∇f θ N k (x ), ∇f γ N k (t) (x) dµ N (x ), (B.

76) and by definition

∇f γ N k (t) (x), ∇f θ N k (x ) = α L γ N k (t) (x), α L θ N k (x ) + L-1 =0 α γ N k (t) (x), α θ N k (x ) β γ N k (t) (x), β θ N k (x ) . By construction of the network, the feature maps (t, x) → α γ N k (t) (x) are continuous. For the backward feature maps, we can write for any θ 1 = (W 1 1 , . . . , W L+1

1

) and any θ 2 = (W 1 2 , . . . , W L+1 2 ) using Cauchy-Schwarz β θ1 (x), β θ2 (x ) ≤ L = +1 W +1 1 W +1 2 , and the RHS of this bound is a continuous function of (θ, x). Because our domain of interest [0, 1] × M is compact, we have from the triangle inequality, the previous bound on the backward feature inner products and the Weierstrass theorem sup t∈[0,1], x∈M ∇f γ N k (t) (x), ∇f θ N k (x ) < +∞, (B.77) so that in particular, we can bound our expression for the derivative of t → f γ N k (t) (x) using the triangle inequality as -τ ∇L µ N (θ N k ), ∇f γ N k (t) (x) ≤ Cτ M ζ θ N k (x ) dµ N (x ) for some constant C > 0. The RHS of the previous bound does not depend on t, so by an application of (Cohn, 2013, Theorem 6.3.11) , it follows that t → f γ N k (t) (x) is absolutely continuous, and we have the representation ζ N k+1 (x) -ζ N k (x) = -τ 1 0 ∇L µ N (θ N k ), ∇f γ N k (t) (x) dt. Using (B.76), we can express this as ζ N k+1 (x) -ζ N k (x) = -τ 1 0 M ζ θ N k (x ) ∇f θ N k (x ), ∇f γ N k (t) (x) dµ N (x ) dt. To conclude, it will be convenient to switch the order of integration appearing in the previous expression. Applying (B.77), we have ζ θ N k (x ) ∇f θ N k (x ), ∇f γ N k (t) (x) ≤ C ζ θ N k (x ) , and the RHS of this bound is integrable over [0, 1]×M because the network is a continuous function of the input. By Fubini's theorem, it follows ζ N k+1 (x) -ζ N k (x) = -τ M 1 0 ∇f θ N k (x ), ∇f γ N k (t) (x) dt ζ θ N k (x ) dµ N (x ) (B.78) Defining Θ N k (x, x ) = 1 0 ∇f θ N k (x ), ∇f γ N k (t) (x) dt and using (B.77), we can define bounded operators Θ N k : L 2 µ N (M) → L 2 µ N (M) by Θ N k [g](x) = M Θ N k (x, x )g(x ) dµ N (x ), and with this definition, (B.78) becomes ζ N k+1 -ζ N k = -τ Θ N k ζ N k , as claimed. Lemma B.9. For any network parameters θ, define kernels Θ θ (x, x ) = ∇f θ (x ), ∇f θ (x) , and for ∈ {N, ∞}, define corresponding bounded operators on L 2 µ (M) by Θ θ,µ [g](x) = M Θ θ (x, x )g(x ) dµ (x ). For any settings of the parameters θ, the operators Θ θ,µ are self-adjoint, positive, and compact. In particular, they diagonalize in a countable orthonormal basis of L 2 µ (M) functions with corresponding nonnegative eigenvalues.

Proof. When

= N , an identification reduces the operators Θ θ,µ to operators on finitedimensional vector spaces, and the claims follow immediately from general principles and the finite-dimensional spectral theorem. We therefore only work out the details for the case = ∞. Boundedness follows from an argument identical to the one developed in the proof of Lemma B.8, in particular to develop an estimate analogous to (B.77) . This estimate, together with separability and compactness of M, also establishes that Θ θ,∞ is compact, by standard results for Hilbert-Schmidt operators (Heil, 2011, §B) . In addition, this estimate allows us to apply Fubini's theorem to write for any g 1 , g 2 ∈ L 2 µ ∞ (M) g 1 , Θ θ,∞ [g 2 ] L 2 µ ∞ (M) = M×M Θ θ (x, x )g 1 (x)g 2 (x ) dµ ∞ (x) dµ ∞ (x ) = g 2 , Θ θ,∞ [g 1 ] since Θ θ (x, x ) = Θ θ (x , x). A similar calculation establishes positivity: we have for any g ∈ L 2 µ ∞ (M) g, Θ θ,∞ [g] L 2 µ ∞ (M) = M×M ∇f θ (x ), ∇f θ (x) g(x)g(x ) dµ ∞ (x) dµ ∞ (x ) = M ∇f θ (x)g(x) dµ ∞ (x), M ∇f θ (x)g(x) dµ ∞ (x) ≥ 0, where we applied Fubini's theorem and linearity of the integral. These facts and the spectral theorem for self-adjoint compact operators on a Hilbert space imply in particular that the operator Θ θ,∞ can be diagonalized in a countable orthonormal basis of eigenfunctions (v i ) i∈N ⊂ L 2 µ ∞ (M) with corresponding nonnegative eigenvalues (λ i ) i∈N ⊂ [0, +∞). Lemma B.10. Write Θ µ N for the operator defined in Lemma B.9, with the parameters θ set to the initial random network weights and the measure set to µ N . There exist absolute constants c, K, K > 0 such that for any q ≥ 0 and any d ≥ Kd 0 log(nn 0 C M ), if τ < 1 Θ µ N L 2 µ N (M)→L 2 µ N (M) and if in addition n ≥ K L 48+20q d 9 log 9 L, then one has P   0≤k≤L q /(nτ ) ζ N i L 2 µ N (M) ≤ √ d   ≥ 1 -1 + 2L q nτ e -cd . Proof. Consider the nominal error evolution ζ N,nom k , defined iteratively as ζ N,nom k+1 = ζ N,nom k -τ Θ µ N ζ N,nom k ; ζ N,nom 0 = ζ for a step size τ > 0, which satisfies τ < 1 Θ µ N L 2 µ N (M)→L 2 µ N (M) . We will prove the claim by showing that this auxiliary iteration is monotone decreasing in the loss, and close enough to the gradient-like iteration of interest that we can prove that the gradient-like iteration also retains a controlled loss. These dynamics satisfy the 'update equation' ζ N,nom k = Id -τ Θ µ N k [ζ] . Because M is compact and ζ is a continuous function of the input, we have ζ ∈ L ∞ (M) for all values of the random weights. Because µ N is a probability measure, this means ζ has finite L p µ N (M) norm for every p > 0. Meanwhile, the choice of τ and positivity of the operator (by Lemma B.9) guarantees Id -τ Θ µ N L 2 µ N (M)→L 2 µ N (M) ≤ 1, from which it follows from the update equation ζ N,nom k L 2 µ N (M) ≤ ζ L 2 µ N (M) ≤ ζ L ∞ (M) , (B.79) where the last inequality uses that µ N is a probability measure. In particular, this nominal error evolution is nonincreasing in the relevant loss. Now, we recall the update equation for the finitesample dynamics ζ N k+1 = Id -τ Θ N ζ N k , which follows from Lemma B.8. Subtracting and rearranging, this gives an update equation for the difference: ζ N k+1 -ζ N,nom k+1 = Id -τ Θ µ N ζ N k -ζ N,nom k -τ Θ N k -Θ µ N ζ N k . (B.80) Under our hypothesis on τ , (B.80) and the triangle inequality imply the bound M) . Using Jensen's inequality and the Schwarz inequality, we have ζ N k+1 -ζ N,nom k+1 L 2 µ N (M) ≤ ζ N k -ζ N,nom k L 2 µ N (M) + τ ζ N k L 2 µ N (M) Θ N k -Θ µ N L 2 µ N (M)→L 2 µ N ( Θ N k -Θ µ N L 2 µ N (M)→L 2 µ N (M) ≤ sup g L 2 µ N (M) ≤1 M Θ N k ( • , x ) -Θ( • , x ) L 2 µ N (M) |g(x )| dµ N (x ) ≤ sup g L 2 µ N (M) ≤1 Θ N k -Θ L ∞ (M×M) g L 1 µ N (M) ≤ Θ N k -Θ L ∞ (M×M) , since µ N is a probability measure. Defining ∆ N k = max i∈{0,1,...,k} Θ N i -Θ L ∞ (M×M) , by a telescoping series and the identical initial conditions, we thus obtain ζ N k+1 -ζ N,nom k+1 L 2 µ N (M) ≤ τ ∆ N k k i=0 ζ N i L 2 µ N (M) , and the triangle inequality and (B.79) then yield ζ N k+1 L 2 µ N (M) ≤ ζ L ∞ (M) + τ ∆ N k k i=0 ζ N i L 2 µ N (M) . Using a discrete version of (the standard) Gronwall's inequality, the previous bound implies ζ N k L 2 µ N (M) ≤ ζ L ∞ (M) + ζ L ∞ (M) k-1 i=0 τ ∆ N k-1 exp   k-1 j=i+1 τ ∆ N k-1   ≤ ζ L ∞ (M) 1 + kτ ∆ N k-1 exp kτ ∆ N k-1 . (B.81) To conclude, we will use Lemma F.5 and an inductive argument based on (B.81). Let us first observe that by Lemma D.11 (with a rescaling of d, which worsens the absolute constants), we have P ζ L ∞ (M) ≤ √ d 2 ≥ 1 -e -cd (B.82) as long as n ≥ Kd 4 L and d ≥ K d 0 log(nn 0 C M ). Define events E N k by E N k = ζ N k L 2 µ N (M) > √ d , where d > 0 is sufficiently large to satisfy the conditions on d given above. We are interested in controlling the probability of k i=0 E N k for k ∈ N 0 . We can write P k i=0 E N i = P k-1 i=0 E N i + P E N k ∩ k-1 i=0 E N i c , and unraveling, we obtain P k i=0 E N k = k i=0 P   E N i ∩ i-1 j=0 E N j c   . In words, it is enough to control the sum of the measures of the parts of E N k that are common with the part of the space where none of the past events occurs. First, note that (B.82) implies P E N 0 ≤ e -cd , and so assume i > 0 below. For any q > 0, if kτ ≤ L q /n, n ≥ KL 36+8q d 9 and d ≥ K d 0 log(nn 0 C M ), Lemma F.5 gives that there are events B N i that respectively contain the sets {∆ N i-1 > CL 4+2q/3 d 3/4 n 11/12 log 3/4 L}, and which satisfy in addition P   B N i ∩ i-1 j=0 E N j c   ≤ e -cd . We thus have by this last bound, a partition, and intersection monotonicity P   E N i ∩ i-1 j=0 E N j c   ≤ e -cd + P E N i ∩ B N i c , and by construction, one has and (B.82) give ∆ N i-1 ≤ CL 4+2q/3 d 3/4 n 11/12 log 3/4 L on B N i c . Another partition P E N i ∩ B N i c ≤ e -cd + P E N i ∩ B N i c ∩ ζ L ∞ (M) ≤ √ d 2 . When the two events on the RHS of the last bound are active, we can obtain using (B.81) ζ N k L 2 µ N (M) ≤ √ d 2 1 + kτ CL 4+2q/3 d 3/4 n 11/12 log 3/4 L exp kτ CL 4+2q/3 d 3/4 n 11/12 log 3/4 L . Given that kτ ≤ L q /n, we have kτ CL 4+2q/3 d 3/4 n 11/12 log 3/4 L ≤ C 12 L 48+20q d 9 log 9 L n 1/12 ≤ 1/e, where the last bound holds provided n ≥ KL 48+20q d 9 log 9 L . Thus, on the event B N i c ∩ ζ L ∞ (M) ≤ √ d 2 , we have ζ N k L 2 µ N (M) ≤ √ d, and thus P E N i ∩ B N i c ∩ ζ L ∞ (M) ≤ √ d 2 = 0. By our previous reductions, we conclude P   E N i ∩ i-1 j=0 E N j c   ≤ 2e -cd , and in particular P k i=0 E N k ≤ (2k + 1)e -cd . The claim is then established by taking k as large as L q /(nτ ). Corollary B.11. Write Θ µ N for the operator defined in Lemma B.9, with the parameters θ set to the initial random network weights θ 0 and the measure set to µ N , and define for k ∈ N 0 ∆ N k = max i∈{0,1,...,k} Θ N i -Θ L ∞ (M×M) . There exist absolute constants c, C, C , K, K > 0 such that for any q ≥ 0 and any d ≥ Kd 0 log(nn 0 C M ), if τ < 1 Θ µ N L 2 µ N (M)→L 2 µ N (M) . and if in addition n ≥ K L 48+20q d 9 log 9 L, then one has on an event of probability at least 1 - C (1 + L q /(nτ ))e -cd ∆ N L q /(nτ ) -1 ≤ C log 3/4 (L)d 3/4 L 4+2q/3 n 11/12 . Proof. Use Lemma B.10 to remove the hypothesis about boundedness of the errors from Lemma F.5, then apply this result together with a union bound. Lemma B.12. Write Θ for the operator defined in Lemma B.9, with the parameters θ set to the initial random network weights and the measure set to µ ∞ . Consider the (population) nominal error evolution ζ ∞ k , defined iteratively as ζ ∞ k+1 = ζ ∞ k -τ Θ [ζ ∞ k ] ; ζ ∞ 0 = ζ for a step size τ > 0, which satisfies τ < 1 Θ L 2 µ ∞ (M)→L 2 µ ∞ (M) . Then for any g ∈ L 2 µ ∞ (M) and any k satisfying kτ ≥ 3e 2 g L 2 µ ∞ (M) ζ L ∞ (M) , we have ζ ∞ k L 2 µ ∞ (M) ≤ √ 3 Θ[g] -ζ L 2 µ ∞ (M) - 3 g L 2 µ ∞ (M) kτ log 3 2 g L 2 µ ∞ (M) ζ L ∞ (M) kτ . Proof. The dynamics satisfy the 'update equation' ζ ∞ k = (Id -τ Θ) k [ζ] . Because M is compact and ζ is a continuous function of the input, we have ζ ∈ L ∞ (M) for all values of the random weights. Because µ ∞ is a probability measure, this means ζ has finite L p µ ∞ (M) norm for every p > 0. Using the eigendecomposition of Θ as developed in Lemma B.9, we can therefore write ζ = ∞ i=0 v i , ζ L 2 µ ∞ (M) v i in the sense of L 2 µ ∞ (M). Because Θ and Id -τ Θ diagonalize simultaneously, we obtain ζ ∞ k 2 L 2 µ ∞ (M) = ∞ i=1 (1 -τ λ i ) 2k v i , ζ 2 L 2 µ ∞ (M) ≤ ∞ i=1 e -2kτ λi v i , ζ 2 L 2 µ ∞ (M) , where the inequality follows from the elementary estimate 1 -x ≤ e -x for x ≥ 0 and our choice of τ , which guarantees that 1 -τ λ i > 0 for all i ∈ N so that the elementary estimate is valid after squaring. We can split this last sum into two parts: for any λ ∈ R, we have ζ ∞ k 2 L 2 µ ∞ (M) = i : λi≥λ e -2kτ λi v i , ζ 2 L 2 µ ∞ (M) + i : λi<λ e -2kτ λi v i , ζ 2 L 2 µ ∞ (M) . Because Θ is positive, we have further that λ i ≥ 0 for all i, so we can take λ ≥ 0. The first sum consists of large eigenvalues: we use exp(-2kτ λ i ) ≤ exp(-2kτ λ) to preserve their effect, and then upper bound the remainder of the sum by the squared L 2 µ ∞ norm of ζ. The second sum consists of small eigenvalues: we replace exp(-2kτ λ i ) ≤ 1, and then plug in ζ = Θ[g] -(Θ[g] -ζ) and use bilinearity, self-adjointness of Θ, and the triangle inequality to get M) . We then square both (nonnegative) sides of the inequality and use Cauchy-Schwarz to replace the squared sum with the sum of squares times a constant, obtaining v i , ζ L 2 µ ∞ (M) ≤ λ v i , g L 2 µ ∞ (M) + v i , Θ[g] -ζ L 2 µ ∞ ( ζ ∞ k 2 L 2 µ ∞ (M) ≤ e -2kτ λ ζ 2 L 2 µ ∞ (M) + 3λ 2 g 2 L 2 µ ∞ (M) + 3 Θ[g] -ζ 2 L 2 µ ∞ (M) after re-adding indices i to the sum to obtain the third residual. We will choose λ ≥ 0 to minimize the sum of the first and second terms. Differentiating and setting to zero gives the critical point equation 2 3 ζ 2 L 2 µ ∞ (M) (kτ ) 2 g 2 L 2 µ ∞ (M) = (2tλ)e 2kτ λ , which can be inverted to give the unique critical point λ = 1 2kτ W   2 3 ζ 2 L 2 µ ∞ (M) (kτ ) 2 g 2 L 2 µ ∞ (M)   , where W is the Lambert W function, defined as the principal branch of the inverse of z → ze z ; we know that this critical point is a minimizer because the function of λ we differentiated diverges as λ → ∞. Plugging this point into the sum of the first two terms gives ζ ∞ k 2 L 2 µ ∞ (M) ≤ 3 Θ[g] -ζ 2 L 2 µ ∞ (M) + g 2 L 2 µ ∞ (M) (2/3)(kτ ) 2   1 + 1 2 W   ζ 2 L 2 µ ∞ (M) (kτ ) 2 (3/2) g 2 L 2 µ ∞ (M)     W   ζ 2 L 2 µ ∞ (M) (kτ ) 2 (3/2) g 2 L 2 µ ∞ (M)   . For x ≥ 0, the function x → W (x) is strictly increasing, as the inverse of y → ye y ; by definition W (e) = 1; and we have the representation W (z) + log W (z) = log z (Corless et al., 1996) , whence W (x) ≤ log x if x ≥ e. Because µ ∞ is a probability measure, we have ζ 2 L 2 µ ∞ (M) ≤ ζ 2 L ∞ , and therefore if kτ ≥ 3e 2 g L 2 µ ∞ (M) ζ L ∞ (M) , we can simplify the previous bound to ζ ∞ k 2 L 2 µ ∞ (M) ≤ 3 Θ[g] -ζ 2 L 2 µ ∞ (M) + 9 g 2 L 2 µ ∞ (M) 4(kτ ) 2 log 2   3 2 g 2 L 2 µ ∞ (M) ζ 2 L ∞ (M) (kτ ) 2   , using also properties of the logarithm. Taking square roots and using the Minkowski inequality then yields ζ ∞ k L 2 µ ∞ (M) ≤ √ 3 Θ[g] -ζ L 2 µ ∞ (M) - 3 g L 2 µ ∞ (M) kτ log 3 2 g L 2 µ ∞ (M) ζ L ∞ (M) kτ , where we used the previous lower bound on kτ to determine the sign that the absolute value of the logarithm takes. This gives the claim. Lemma B.13 (Kantorovich-Rubinstein Duality). Let Lip(M) denote the class of functions f : M → R such that both f M± are Lipschitz with respect to the Riemannian distances on M ± . For any d ≥ 1, any 0 < δ ≤ 1 and any N ≥ 2 √ d/ min{µ ∞ (M + ), µ ∞ (M -) }, one has that on an event of probability at least 1 -8e -d , simultaneously for all f ∈ Lip(M) M f (x) dµ ∞ (x) - M f (x) dµ N (x) ≤ 2 f L ∞ (M) √ d N + e 14/δ C µ ∞ ,M √ d max ∈{+,-} f M Lip N 1/(2+δ) , where C µ ∞ ,M = len (M + ) µ ∞ (M + ) + len (M -) µ ∞ (M -) . Proof. The proof is an application of the Kantorovich-Rubinstein duality theorem for the 1-Wasserstein distance (Weed & Bach, 2019, eq . ( 1)), which states that for any two Borel probability measures µ, ν on M ± , one has W (µ, ν) = sup f Lip≤1 M± f (x) dµ(x) - M± f (x) dν(x) , where M ± denotes either of M + or M -, and • Lip is the minimal Lipschitz constant with respect to the Riemannian distance on M ± . Therefore for any f : M ± → R Lipschitz, we have M± f (x) dµ(x) - M± f (x) dν(x) ≤ f Lip W (µ, ν) , (B.83) where one checks separately the case where f Lip = 0 to see that this bound holds there as well. To go from (B.83) to the desired conclusion, we need to pass from the measures µ ∞ and µ N , both supported on M, to measures µ ± (with ∈ {N, ∞}), supported on the manifolds M ± (which we will define in detail below); the challenge here is that the number of 'hits' of each manifold M ± that show up in the finite sample measure µ N is a random variable, which requires a small detour to control. Let us define random variables N + , N -by N + = N µ N (M + ) ; N -= N µ N (M -) , so that N ± have support in {0, 1, . . . , N }, and N + + N -= N . Define in addition p + = µ ∞ (M + ) ; p -= µ ∞ (M -) , which represent the degree of imbalance between the positive and negative classes in the data. By definition of the i.i.d. sample, we have that N + ∼ Binom(N, p + ). Using N + and N -, we can define 'conditional' finite sample measures µ N + and µ N -by µ N + = 1 max{1, N + } i∈[N ] : xi∈M+ δ {xi} ; µ N -= 1 max{1, N -} i∈[N ] : xi∈M- δ {xi} , so that (N + /N )µ N + + (N -/N )µ N -= µ N ,foot_8 and µ N + and µ N -are both probability measures except when N + ∈ {0, N }, in which case exactly one is a probability measure. By the triangle inequality, we have for any continuous f : M → R M f (x) dµ ∞ (x) - M f (x) dµ N (x) ≤ ∈{+,-} p M f (x) dµ ∞ (x) p - N N M f (x) dµ N (x) ≤ ∈{+,-} f L ∞ (M) N N -p + M f (x) dµ ∞ (x) p - M f (x) dµ N (x) . (B.84) By Lemma G.1, we have P N N -p ≤ √ d N ≥ 1 -2e -2d . (B.85) Using that N -N + = N -and 1 -p + = p -, the bound (B.85) implies if N ≥ 2 √ d/ min{p + , p -} P p 2 ≤ N N ≤ 1 -p 2 ≥ 1 -2e -2d . (B.86) Now fix an arbitrary f ∈ Lip(M). For either ∈ {+, -}, we can write M f (x) dµ N (x) = 1 max{1, N } i : xi∈M f (x i ) = 1 max{1, N i=1 1 xi∈M } N i=1 1 xi∈M f (x i ), and since M + and M -are separated by a positive distance ∆ > 0, we have that x i → 1 xi∈M are continuous functions on M. Since f is continuous on M by the same reasoning and the fact that M is compact, it follows that the functions (x 1 , . . . , x N ) → M f (x) dµ N (x) are continuous on M × • • • × M as well, and in particular for any t > 0 the sets M f (x) dµ ∞ (x) p - M f (x) dµ N (x) > t are open in M, and so is their union over all f ∈ Lip(M). By conditioning, we can then apply (B.86) to write P   f ∈Lip(M) M f (x) dµ ∞ (x) p - M f (x) dµ N (x) > t   ≤ 2e -2d + N (1-p )/2 k= N p /2 P   f ∈Lip(M) M f (x) dµ ∞ (x) p -M f (x) dµ N (x) > t N = k   P[N = k]. (B.87) Conditioned on {N = k} with 0 < k < N , the measure µ N is distributed as an empirical measure of sample size k from the probability measure µ ∞ /p supported on M . For N p /2 ≤ k ≤ N (1 -p )/2 , any δ > 0 and any d ≥ 1 we have for both possible values of √ de 14/δ len(M ) k 1/(2+δ) ≤ √ de 14/δ len(M ) N p 2 1/(2+δ) ≤ √ 2de 14/δ len(M ) (N p ) 1/(2+δ) , and so an application of Lemma B.16 thus gives for any 0 < δ ≤ 1 and any d ≥ 2 P W dµ ∞ (x) p , dµ N > √ de 14/δ len(M ) (N p ) 1/(2+δ) N = k ≤ e -d . Combining this last bound with (B.83) and (B.87) gives P    f ∈Lip(M)      M f (x) dµ ∞ (x) p -M f (x) dµ N (x) > √ de 14/δ f M Lip N 1/(2+δ) len(M ) p         ≤ 3e -d . where we used max{p + , p -} ≤ 1 to remove the exponent of 1/(2 + δ) on these terms. Taking a max over the Lipschitz constants and combining this bound with (B.84) and (B.85) and a union bound, we obtain P      f ∈Lip(M)          M f (x) dµ ∞ (x) -M f (x) dµ N (x) > 2 f L ∞ (M) √ d N + e 14/δ C µ ∞ ,M √ d max ∈{+,-} f M Lip N 1/(2+δ)               ≤ 8e -d , where the constant is defined as in the statement of the lemma. Lemma B.14. Let d 0 = 1. There is an absolute constant C > 0 such that for any function f : M → R with f M± Lipschitz with respect to the Riemannian distances on M ± , one has f L ∞ ≤ C max          ρ 1/2 max f L 2 µ ∞ (M) ρ 1/2 min (min {µ ∞ (M+),µ ∞ (M-)}) 1/2 , f 2/3 L 2 µ ∞ (M) max ∈{+,-} f M 1/3 Lip ρ 1/3 min          . Proof. For any T > 0 and a nonconstant Lipschitz function f : [0, T ] → R, we will establish the inequality f L ∞ ≤ C max f L 2 √ T , f 2/3 L 2 f 1/3 Lip , (B.88) where the constant C > 0 is absolute. We can use this result to establish the claim. We start by writing f L ∞ = max ∈{+,-} f M L ∞ , and for ∈ {+, -}, we have f M L ∞ = f • γ L ∞ , (B.89) where γ : [0, len(M )] → M are the smooth unit-speed curves parameterized with respect to arc length parameterizing the manifolds. Similarly, the curves' parameterization with respect to arc length implies  f • γ Lip ≤ f M Lip . (B. f M L ∞ ≤ C max f • γ L 2 len(M ) , f • γ 2/3 L 2 f M 1/3 Lip . For the first term in the max, we have f • γ 2 L 2 len(M ) = 1 len(M ) len(M ) 0 f • γ (t) 2 dt ≤ len(M ) 0 f • γ (t) 2 ρ • γ (t) ρ • γ (t) - 1 len(M ) ρ • γ (t) dt + len(M ) 0 f • γ (t) 2 ρ • γ (t) dt using the triangle inequality. For the second term in the last bound, we note that len(M ) 0 f • γ (t) 2 ρ • γ (t) dt ≤ len(M+) 0 f • γ + (t) 2 ρ + • γ + (t) dt + len(M-) 0 f • γ -(t) 2 ρ -• γ -(t) dt ≤ f 2 L 2 µ ∞ (M) , (B.91) and for the first term, we have max t∈[0,len(M )] ρ • γ (t) - 1 len(M ) ρ • γ (t) ≤ max t∈[0,len(M )] ρ • γ (t) -ρ •γ (t) µ ∞ (M ) + ρ •γ (t) µ ∞ (M ) - 1 len(M ) ρ • γ (t) ≤ 1 -µ ∞ (M ) µ ∞ (M ) + ρ max µ ∞ (M )ρ min ≤ 2ρ max µ ∞ (M )ρ min , (B.92) where in the first inequality we used the triangle inequality, and for the second we used that ρ •γ integrates to µ ∞ (M ) over [0, len(M )], which implies that there exists at least one t ∈ [0, len(M )] at which ρ • γ (t) ≥ µ ∞ (M )/ len(M ), so that the maximum of the difference in the second term on the RHS of the first inequality is bounded by the maximum of the density term. Thus, by Hölder's inequality and (B.91) and (B.92), we have len(M ) 0 f • γ (t) 2 ρ • γ (t) ρ • γ (t) - 1 len(M ) ρ • γ (t) dt ≤ 3ρ max ρ min min {µ ∞ (M + ), µ ∞ (M -)} f 2 L 2 µ ∞ (M) Similarly, for the second term in the max, we have f • γ 2/3 L 2 = len(M ) 0 f • γ (t) 2 dt 1/3 ≤ len(M+) 0 f • γ + (t) 2 dt + len(M-) 0 f • γ -(t) 2 dt 1/3 ≤ 1 ρ 1/3 min len(M+) 0 f • γ + (t) 2 ρ + • γ + (t) dt + len(M-) 0 f • γ -(t) 2 ρ -• γ -(t) dt 1/3 ≤ f 2/3 L 2 µ ∞ (M) ρ 1/3 min . Thus, we have f M L ∞ ≤ C max        ρ 1/2 max f L 2 µ ∞ (M) ρ 1/2 min (min {µ ∞ (M + ), µ ∞ (M -)}) 1/2 , f 2/3 L 2 µ ∞ (M) f M 1/3 Lip ρ 1/3 min        , and taking a maximum over ∈ {+, -} establishes the claim. To prove (B.88), consider first the trivial case where f L ∞ = 0: here the LHS and RHS of (B.88) are identical, and the proof is immediate. When f L ∞ > 0, the Weierstrass theorem implies that there exists t ∈ [0, T ] such that |f (t)| = f L ∞ ; we consider the case sign(f (t)) > 0. For any t ∈ [0, T ], we can write by the Lipschitz property f (t ) ≥ f L ∞ -f Lip |t -t |, and the RHS of the previous bound is nonnegative on the intersection of the interval [t - f L ∞ f -1 Lip , t + f L ∞ f -1 Lip ] with the domain [0, T ] (with standard extended-valued arithmetic conventions when f Lip = 0). This gives the bound f 2 L 2 ≥ min t+ f L ∞ f Lip ,T max t- f L ∞ f Lip ,0 ( f L ∞ -f Lip |t -t |) 2 dt = min f L ∞ f Lip ,T -t max - f L ∞ f Lip ,-t ( f L ∞ -f Lip |t |) 2 dt , where the second line follows from the changes of variables t → t + t. The integrand on the RHS of the second line in the previous bound is even-symmetric, and max -f L ∞ f Lip , -t = -min f L ∞ f Lip , t , so we can discard one side of the interval of integration to get min f L ∞ f Lip ,T -t max - f L ∞ f Lip ,-t ( f L ∞ -f Lip |t |) 2 dt (B.93) ≥ min f L ∞ f Lip ,max{t,T -t} 0 ( f L ∞ -f Lip t ) 2 dt . (B.94) We proceed analyzing two distinct cases. First, if f L ∞ ≤ max{t, T -t} f Lip , then we must have f Lip > 0; integrating the RHS of (B.94), we obtain f 2 L 2 ≥ f 3 L ∞ 3 f Lip , or equivalently f L ∞ ≤ 3 1/3 f 2/3 L 2 f 1/3 Lip . (B.95) Next, we consider the case f L ∞ > max{t, T -t} f Lip . We split on two sub-cases: when f Lip = 0, integrating (B.94) gives f 2 L 2 ≥ f 2 L ∞ max{t, T -t} ≥ T f 2 L ∞ 2 , (B.96) where we used max{t, T -t} ≥ T /2. When f Lip > 0, integrating (B.94) gives f 2 L 2 ≥ 1 3 f Lip f 3 L ∞ -( f L ∞ -f Lip max{t, T -t}) 3 = max{t, T -t} 3 2 k=0 f 2-k L ∞ ( f L ∞ -f Lip max{t, T -t}) k ≥ T f 2 L ∞ 6 , (B.97) where the second line uses a standard algebraic identity, and the third line uses max{t, T -t} ≥ T /2 together with the definition of the case to get that f L ∞ -f Lip max{t, T -t} > 0 in order to discard all but the k = 0 summand. Combining (B.97) and (B.96), we obtain for this case  f L ∞ ≤ √ 6 f L 2 √ T , (B. f L ∞ ≤ max √ 6 f L 2 √ T , 3 1/3 f 2/3 L 2 f 1/3 Lip , which establishes (B.88). For the case sign(f (t)) < 0, apply the preceding argument to -f to conclude. See (Brezis, 2011, Exercise 8.15) for a sketch of a proof that leads to more general versions of (B.88). Lemma B.15. For any p ∈ N, if C ≥ (4p) 4p , then one has n ≥ C log p n if n ≥ 2 p C log p (2 p C). Proof. We first give a proof for p = 1, then build off this proof for the general case. Consider the function f (x) = cx -log x. We have f (x) = c -1/x, which is nonnegative for every x ≥ 1/c, so in particular f is increasing under this condition. By concavity of the logarithm, we have log x ≤ log(2/c) + (c/2)(x -2/c), whence f (x) ≥ 1 + cx/2 -log(2/c). The RHS of this bound is equal to zero at x = (2/c)(log(2/c) -1), and 2 c log 2 c -1 ≥ 1 c ⇐⇒ c ≤ 2e -3/2 . In particular, we have f (x) ≥ 0 for every x ≥ (2/c) log(2/c). Rearranging this bound, we can assert the desired conclusion that if C ≥ 3, then n ≥ C log n for every n ≥ 2C log 2C. Equivalently, we have for all such n that Cn -1 log n ≤ 1. Next, we consider the case of p > 1. We will show C log p n n ≤ 1 under suitable conditions. Let us consider the choice n = KC log p KC, where K > 0 is a constant we will specify below. Consider the function f (x) = Cx -1 log p x, which satisfies f (x) = C log p-1 (x)(p -log p-1 (x)) x 2 . In particular, f is decreasing as soon as p ≤ log p-1 (x). Now, we can calculate f (KC log p KC) = 1 K 1 + p log log KC log KC p , and by our result for the case p = 1, we have for all p ≥ 2 p log log KC log KC ≤ 1 if log KC ≥ 4p log 4p. This condition is satisfied for KC ≥ (4p) 4p , so if we set K = 2 p , we obtain the above conclusion when C ≥ (4p) 4p . Under these conditions, we then get f (KC log p KC) ≤ 1. Similarly, we have log p-1 (KC log KC) ≥ log p-1 ((4p) 4p ) = (4p) p-1 log p-1 (4p), which is larger than p because 4p ≥ e. It follows that f (x) ≤ 1 for every x ≥ KC log KC, which completes the proof. Lemma B.16 (Concentration of Empirical Measure in Wasserstein Distance (Weed & Bach, 2019) ). Let d 0 = 1. For either ∈ {+, -}, let µ be a Borel probability measure on M , and write µ N for the empirical measure corresponding to N i.i.d. samples from µ. Then for any d ≥ 1 and any 0 < δ ≤ 1, one has P W(µ, µ N ) ≤ √ de 14/δ len(M ) N 1/(2+δ) ≥ 1 -e -2d , where the 1-Wasserstein distance is taken with respect to the Riemannian distance. Proof. The proof is a direct application of the results of (Weed & Bach, 2019) on concentration of empirical measures in Wasserstein distance. For the duration of the proof, we will work on the metric space (M , len(M ) -1 dist M ( • )), i.e., the same metric space scaled to have unit diameter; we will then obtain the result in terms of the unscaled metric by the definition of the 1-Wasserstein distance. Because d 0 = 1 and M can be given as a unit-speed curve parameterized with respect to arc length, we have for any Borel S ⊂ [0, 1] and any ε > 0 N ε (S) ≤ 1 ε , where N ε (S) denotes the ε-covering number of S by closed balls in the rescaled metric. Following the notation of (Weed & Bach, 2019, §4 .1), we then obtain for any s > 2 d ε (µ, ε s/(s-2) ) = log inf N ε (S) µ(S) ≥ 1 -ε s/(s-2) -log ε ≤ 1. Invoking (Weed & Bach, 2019, Proposition 5) , we obtain after some simplifications of the constants that for any 0 < δ ≤ 1 (putting s = δ + 2 in the previous estimates) 2+δ) , where the final inequality worst-cases constants for convenience. Using (Weed & Bach, 2019 , Proposition 20), we have E W µ, µ N ≤ 3 11/δ N -1/(2+δ) + 3 6 N -1/2 ≤ e 14/δ N -1/( P W(µ, µ N ) + E W(µ, µ N ) ≥ d N ≤ e -2d , and hence P W(µ, µ N ) ≥ √ de 14/δ N 1/(2+δ) ≤ P W(µ, µ N ) ≥ e 14/δ N 1/(2+δ) + d N ≤P W(µ, µ N ) + E W(µ, µ N ) ≥ d N ≤ e -2d if d ≥ 1. Lemma B.17. Let n, m ∈ N. Let F : R n → R m be 1-nonnegatively homogeneous, and suppose there exist M, L ≥ 0 such that 1. F S n-1 2 L ∞ ≤ M ; 2. F S n-1 is L-Lipschitz. Then for any x, x ∈ R n , one has F (x) -F (x ) 2 ≤ (2L + M ) x -x 2 , so that F is (2L + M )-Lipschitz. Proof. For any numbers a, b ≥ 0 and any u, v ∈ R m , one has by the triangle inequality au -bv 2 ≤ min{a u -v 2 + |a -b| v 2 , b u -v 2 + |a -b| u 2 }. Using an elementary property of the min and the max, we thus have au -bv 2 ≤ min{a, b} u -v 2 + max{ u 2 , v 2 }|a -b|. (B. 99) Now we proceed to show the claim. Noting that the case where both x, x are zero is trivial, first consider the case where x is nonzero and x is zero. By nonnegative homogeneity, it suffices to proceed as F (x) -F (x ) 2 = F (x) 2 = x 2 F x x 2 2 ≤ M x 2 = M x -x 2 to conclude; for the inequality we used the boundedness assumption on F . Now fix x, x ∈ R n nonzero. The inequality (B.99) can be applied to get F (x) -F (x ) 2 = x 2 F x x 2 -x 2 F x x 2 2 ≤ min{ x 2 , x 2 } F x x 2 -F x x 2 2 + max F x x 2 2 , F x x 2 2 x -x 2 , where in the inequality we also applied the 2 triangle inequality. Using the assumed properties of F , we thus have F (x) -F (x ) 2 ≤ L min{ x 2 , x 2 } x x 2 - x x 2 2 + M x -x 2 . By a classical inequality (e.g. proved in (E.15)), one has x x 2 - x x 2 2 ≤ 2 x -x 2 max{ x 2 , x 2 } , whence F (x) -F (x ) 2 ≤ (2L + M ) x -x 2 , as was to be shown.

C SKELETON ANALYSIS AND CERTIFICATE CONSTRUCTION

In this section, we construct a certificate g for the certificate problem (B.1) in the context of a simple model geometry. We also collect technical estimates relevant to the analysis of the skeleton ψ. We point to Appendix A.5.2 for a summary of the operator and function definitions relevant to the certificate problem that we will use below. We will use the notation Θ[g](x) = M ψ • ∠(x, x )g(x ) dµ ∞ (x ) in this section; we call explicit attention to this notation to avoid confusion with the kernel Θ = ψ 1 • ∠ that we have defined in the main text for convenience of exposition.

C.1 CERTIFICATE CONSTRUCTION

To construct a certificate, it suffices to solve the integral equation ζ = Θ[g] (C.1) for a function g ∈ L 2 µ ∞ (M) and obtain estimates on the norm of g. It is useful to consider separately the contributions of integration over the class manifolds M ± in the action of the operator Θ: we can write for any g Θ[g](x) = M+ ψ • ∠(x, x )g(x ) dµ ∞ + (x ) + M- ψ • ∠(x, x )g(x ) dµ ∞ -(x ), and it then makes sense to further subdivide based on whether the evaluation point x lies in M + or M -, and to introduce the density ρ explicitly by a change of variables. With a slight abuse of notation, we will write dx to denote the Riemannian measure on M + and M -, for concision. Because the kernel ψ • ∠ is symmetric, if we define an operator Θ+ : L 2 (M + ) → L 2 (M + ) by Θ+ [g + ](x) = M+ ψ • ∠(x, x )g + (x ) dx , an operator Θ-: L 2 (M -) → L 2 (M -) by Θ-[g -](x) = M- ψ • ∠(x, x )g -(x ) dx , and an operator Θ± : L 2 (M + ) → L 2 (M -) by Θ± [g + ](x) = M+ ψ • ∠(x, x )g + (x ) dx , then we can write the certificate system (C.1) equivalently as the 2 × 2 block operator equation ζ+ ζ- = Θ+ Θ * ± Θ± Θ- ρ + g + ρ -g - , where we write ρ + and ρ -for the restriction of the density ρ to M + and M -, respectively, and where the adjoint operation is viewed as occurring with operators between L 2 (M + ) and L 2 (M -) (both Hilbert spaces). We will make use of this notation in the sequel.

C.1.1 TWO CIRCLES

The two circles geometry is a highly-symmetric geometry where M + and M -are coaxial circles in the upper and lower hemispheres of S 2 , each of radius 0 < r < 1. Here we note that since the skeleton ψ depends only on the angle between points of S n0-1 , the particular embedding of this geometry into S n0-1 is irrelevant, and it is without loss of generality to consider the geometry in S 2 once we have restricted ourselves to this configuration. We have unit-speed charts, for t ∈ [0, 2πr] γ + (t) =   r cos t/r r sin t/r √ 1 -r 2   , γ -(t) =   r cos t/r r sin t/r - √ 1 -r 2   , which implies specific forms of the spherical distances ∠ (γ + (t), γ + (t )) = cos -1 r 2 cos t -t r + (1 -r 2 ) (C.2) and ∠ (γ + (t), γ -(t )) = cos -1 r 2 cos t -t r -(1 -r 2 ) , (C.3) with the analogous results for the remaining possible combinations of domains, by symmetry. Because ζ is piecewise constant on each connected component of M, there are constants C + , C -such that C + = ζ on M + and C -= ζ on M -. The block-structured system we are interested in solving is then C + C - = Θ+ Θ * ± Θ± Θ- ρ + g + ρ -g - , (C.4) where subscripts are used to denote the domain of each component of the certificate. The coordinate representations (C.2) and (C.3) show that each of the operators appearing in the 2 × 2 matrix in (C.4) is invariant on the circle; we can obtain some useful simplifications by identifying these operators with their coordinate representations. Defining f r (t) = cos -1 r 2 cos t + (1 -r 2 ) , g r (t) = cos -1 r 2 cos t -(1 -r 2 ) , and (self-adjoint) operators on 2π-periodic functions g by A[g](t) = 2π 0 ψ • f r (t -t )g(t ) dt , X [g](t) = 2π 0 ψ • g r (t -t )g(t ) dt , by a change of coordinates, it is equivalent to solve the system r -1 C + r -1 C - = A X X A ρ + g + ρ -g - , (C.5) where we have identified ρ + and ρ -with their coordinate representations, and with an abuse of notation used the same notation for the certificate as in (C.4). We can use symmetry properties to determine g r (t) = π -f r (t -π), so for purposes of analysis we need only consider f r . Each of the invariant operators in (C.5) diagonalizes in the Fourier basis, and because the target ζ is a piecewise constant function, we only need to use the first Fourier coefficient. In other words, we can solve this system by first inverting the invariant operator, which responds to only the constant component of the target, and then inverting the density multiplication operators. This approach is made precise in the following lemma. Lemma C.1. There is an absolute constant K > 0 such that if L ≥ max{K, (π/2)(1 -r 2 ) -1/2 } and r ≥ 1 2 , then the system (C.4) has a solution that satisfies g + g -L 2 µ ∞ ≤ 64 ζ L ∞ (M) nπ 1/2 ρ 1/2 min . Proof. Following the discussion by (C.5), it is equivalent to solve the system in the Fourier basis, with only the DC component. We thus start by solving the system r -1 C + r -1 C - = 2 π 0 ψ • f r (t) dt 2 π 0 ψ • g r (t) dt 2 π 0 ψ • g r (t) dt 2 π 0 ψ • f r (t) dt G + G - , where G + and G -are constants that we will show exist. This is a 2 × 2 system, and the matrix is symmetric, with minimum eigenvalue 2 π 0 (ψ • f r -ψ • g r )(t) dt. Using Lemma C.2, we have if L ≥ max{K, (π/2)(1 -r 2 ) -1/2 } and r ≥ 1 2 2 π 0 (ψ • f r -ψ • g r )(t) dt ≥ πn 32r , so the 2 × 2 matrix is invertible, and by an operator norm bound on its inverse we have the regularity estimate G 2 + + G 2 - 1/2 ≤ 32 πn C 2 + + C 2 - 1/2 . It follows that the function g + g - = G+ ρ+ G- ρ- solves the system (C.4). We conclude g + g - 2 L 2 µ ∞ = 2π 0 G + ρ + • γ + (t) 2 ρ + • γ + (t) dt + 2π 0 G - ρ -• γ -(t) 2 ρ -• γ -(t) dt ≤ 2 11 πn 2 ρ min C 2 + + C 2 -. Taking square roots on both sides of the expression resulting from the last inequality will give the claim, after we simplify the expression C 2 + + C 2 -. Since C 2 + + C 2 -≤ √ 2 max{C + , C -} = √ 2 ζ L ∞ (M) , we can conclude after adjusting constants. Lemma C.2. There exists an absolute constant K > 0 such that if L ≥ max{K, (π/2)(1-r 2 ) -1/2 } and if r ≥ 1 2 , one has 2 [0,π] (ψ • f r -ψ • g r )(t) dt ≥ πn 32r . Proof. Write σ r = ψ • f r -ψ • g r for brevity , which is nonnegative, by Lemma C.3. We consider the tangent line to the graph of σ r at 0; by Lemma C.3, this line has the form t → σ r (0) -tnrL(L + 1)/4π, and its graph hits the horizontal axis at t = 4πσ r (0)/nrL(L + 1). Using that σ r (0) ≤ ψ(0) = nL/2, we see that this point of intersection is no larger than 2π/r(L + 1), which can be made less than K by choosing L ≥ K , where K > 0 is the absolute constant appearing in the convexity bound of Lemma C.3, and K > 0 is an absolute constant. Under this condition, we obtain using Lemma C.3 σ r (t) ≥ σ r (0) -tnrL(L + 1)/4π, and so [0,π] σ r (t) dt ≥ [0,4πσr(0)/nrL(L+1)] (σ r (0) -ntrL(L + 1)/4π) dt = 2πσ r (0) 2 nrL(L + 1) . We have σ r (0) = nL/2 -ψ(cos -1 (2r 2 -1)), and using the estimate of Lemma C.20, we get ψ(ν) ≤ nL 2 1 + Lν/2π 1 + Lν/π . Together with the estimate cos -1 (2r 2 -1) ≥ 2 √ 1 -r 2 , we obtain σ r (0) = nL 2 -ψ(cos -1 (2r 2 -1)) ≥ nL 2 L √ 1 -r 2 π + 2L √ 1 -r 2 ≥ nL 8 , where the final inequality requires the choice L ≥ π/2 √ 1 -r 2 . Thus, we have shown [0,π] σ r (t) dt ≥ πn 64r , as claimed.

Lemma C.3.

There is an absolute constant 0 < K ≤ π/2 such that if L ≥ 3, one has for all r ∈ (0, 1): (i) ψ • f r -ψ • g r ≥ 0 on [0, π]; (ii) (ψ • f r -ψ • g r ) (0) = -nrL(L + 1)/4π; (iii) ψ • f r -ψ • g r is convex on [0, K]. Proof. In this proof, we will make use of basic results on the skeleton ψ, namely Lemmas E.5, C.17 and C.18 without making explicit reference to them. Property (i) follows from the fact that ψ is decreasing, cos -1 is decreasing, and the definitions of f r and g r . We note that f r is smooth on (0, π); to prove property (ii), it will suffice to show that f r admits a right derivative at 0 and π and apply the chain rule. We have if t ∈ (0, π) f r (t) = r 2 sin t 1 -(r 2 cos t + (1 -r 2 )) 2 = 1 2 + r 2 (cos t -1) r sin t √ 1 -cos t after some rearranging, and by periodicity and symmetry properties of f r , we have lim t 0 g r (t) = lim t π f r (t) = 0. We Taylor expand sin t(1-cos t) -1/2 in a neighborhood of zero to evaluate the derivatives there. We have sin t = t-t 3 /6+O(t 5 ) and 1-cos t = t 2 /2(1-t 2 /2+O(t 4 )); by the binomial series, we have (1 -cos t) -1/2 = √ 2/t(1 + t 2 /4 + O(t 4 )), whence sin t(1 -cos t) -1/2 = √ 2 + √ 2t 2 /12 + O(t 4 ), lim t 0 f r (t) = r. Thus (ψ • f r -ψ • g r ) (0) = ψ (0)f r (0) = - nrL(L + 1) 4π . For property (iii), now consider t ∈ [0, π/2] when necessary. The chain rule gives (ψ • f r -ψ • g r ) = [(ψ • f r )f r -(ψ • g r )g r ] + [(ψ • f r )(f r ) 2 -(ψ • g r )(g r ) 2 ], and we have if t ∈ (0, π) f r (t) = r 4 (1 -cos t) cos t -(r 2 cos t + (1 -r 2 )) 1 -(r 2 cos t + (1 -r 2 )) 2 3/2 after some rearranging of the numerator. We have 1 -cos t ≥ 0, and so the estimate r 2 cos t + (1r 2 ) ≥ cos t (with equality only at t = 0) yields f r (t) ≤ 0 (with a strict inequality if 0 < t < π). By symmetry, this implies that g r ≥ 0, and using that ψ ≤ 0, we obtain (ψ • f r -ψ • g r ) ≥ (ψ • f r )(f r ) 2 -(ψ • g r )(g r ) 2 . By symmetry, we have g r (t) = f r (π -t) on [0, π], and because f r is strictly concave we know as well that f r is strictly decreasing; it follows that f r -g r is also strictly decreasing, and its unique zero satisfies the equation 1 -cos t 1 + cos t = 2 -r 2 (1 + cos t) 2 -r 2 (1 -cos t) . Noting that t = π/2 satisfies this equation, we conclude that f r ≥ g r on [0, π/2], so that on this interval we have (ψ • f r -ψ • g r ) ≥ (g r ) 2 ((ψ • f r ) -(ψ • g r )) . By Lemma C.19, if L ≥ 3 there is an absolute constant K > 0 such that ... ψ ≤ 0 on [0, K]. The previous bound then yields (ψ • f r -ψ • g r ) ≥ 0, as claimed. C.2 AUXILIARY RESULTS

C.2.1 GEOMETRIC RESULTS

Lemma C.4. Let M be a complete Riemannian submanifold of the unit sphere S n0-1 (with respect to the spherical metric induced by the euclidean metric on R n0 ) with finitely many connected components K. If d 0 = 1, assume moreover that each connected component of M is a smooth regular curve. Then for every 0 < ε ≤ 1, there is a ε-net for M in the euclidean metric • 2 having cardinality no larger than (C M /ε) d0 , where C M ≥ 1 is a constant depending only on the diameters sup x,x ∈Mi dist Mi (x, x ) and, when d 0 ≥ 2, additionally on the extremal Ricci curvatures of M i . Moreover, these nets have the property that if x ∈ M is given, there is a point in the net x within euclidean distance ε of x such that x lies in the same connected component of M as x. Proof. Consider a fixed connected component M i with i ∈ [K]. We write the Riemannian distance of M i as dist Mi ; because M i is a Riemannian submanifold of R n0 , we have dist Mi (x, y) ≥ x -y 2 for every x, y in M i . Because dist Mi (x, y) ≥ x -y 2 , it suffices to estimate the covering number in terms of the Riemannian distance. We will consider distinctly the cases d 0 = 1 and d 0 ≥ 2, starting with d 0 = 1. When d 0 = 1, we have assumed that M i are regular curves, so it is without loss of generality to assume they are moreover unit-speed curves parameterized by arc length, with lengths len(M i ). It follows that we can obtain an ε-net for M i in terms of dist Mi having cardinality at most len(M i )/ε when 0 < ε ≤ 1, and by the submanifold property these sets also constitute ε-nets for M i in terms of the 2 distance. Covering each connected component M i in this way gives a ε-net for M by taking the union of each connected component's net. When d 0 ≥ 2, we make use of standard results relating the covering number to the curvature and diameter of M. Let diam(M i ) = sup x,x ∈Mi dist Mi (x, x ), and let Ric i denote the Ricci curvature tensor of M i (recall that we assume the metric on M is the one induced by the euclidean metric). Then because M is compact, (1) max i∈[K] diam(M i ) < +∞; and (2) because Ric i is moreover continuous, there are constants k i > 0 such that Ric i ≥ -(d 0 -1)k i for each i ∈ [K]. Applying Lemma C.5, it follows that for any ε > 0, there is a ε-net for M i in terms of dist Mi with cardinality no larger than (C Mi /ε) d0 , where C Mi diam(M i )e 2 diam(Mi) √ ki . Thus, for any i ∈ [K], any d 0 ≥ 1 and any 0 < ε ≤ 1, we can conclude that there is a ε-net for M i in the euclidean metric having cardinality no larger than (C Mi /ε) d0 , where C Mi = len(M i ) d 0 = 1 16 diam(M i )e 2 diam(Mi) √ ki d 0 ≥ 2. Taking the union of these nets and applying Lemma G.10 for simplicity, we conclude that for any d 0 ≥ 1 and any 0 < ε ≤ 1, there is a ε-net for M in the euclidean metric having cardinality no larger than (C M /ε) d0 , where C M = 1 + K i=1 len(M i ) d 0 = 1 1 + 16 K i=1 diam(M i )e 2 diam(Mi) √ ki d 0 ≥ 2. The additional property claimed is satisfied by our construction of the nets. Proof. The proof is essentially an application of (Zhu, 1997, Lemma 3.6 ) together with some calculations on volumes of geodesic balls in hyperbolic space that we record here for completeness, although they are classical. For any r > 0 and any p ∈ M, write  and r, ε > 0. The hypotheses of the lemma make (Zhu, 1997, Lemma 3.6 ) applicable, whence B r (p) = {x ∈ M | dist M (p, x) ≤ r}. Fix p ∈ M inf    card(S) S ⊂ B r (p), B r (p) ⊂ p ∈S B ε (p )    ≤ vol(B k (2r)) vol(B k (ε/4)) , where card(S) denotes the cardinality of a set S, and for all ε > 0, vol(B k (ε)) denotes the volume of a geodesic ball of radius r in the d-dimensional simply-connected hyperbolic space of constant sectional curvature -k; these spaces are homogeneous and isotropic so the base point does not matter (c.f. (Lee, 2018, Proposition 3.9)). In particular, we can calculate these volumes in any model of hyperbolic space and anchored at any base point; we choose the Poincaré ball model and the base point 0, where the maximal unit-speed geodesics take the simple form , 2018, Theorem 3.7, Proposition 5.28) . Integrating the volume form in coordinates, we then get for any ε > 0 γ(t) = k -1/2 v tanh √ kt 2 for v ∈ S d and t ∈ R (Lee vol(B k (ε)) = (k -1/2 tanh √ kε/2))B d 2/k 1/k -x 2 2 d dx = k -d/2 (tanh √ kε/2)B d 2 1 -x 2 2 d dx where the second line changes coordinates x → k -1/2 x. Changing to polar coordinates in the last expression, we get vol(B k (ε)) = k -d/2 vol(S d-1 ) [0,tanh √ kε/2] x d-1 2 1 -x 2 d dx, and then changing coordinates x → tanh x, we obtain after applying several trigonometric identities vol(B k (ε)) = k -d/2 vol(S d-1 ) [0, √ kε/2] 2 sinh d-1 (2x) dx = k -d/2 vol(S d-1 ) [0, √ kε] sinh d-1 (x) dx, whence vol(B k (2r)) vol(B k (ε/4)) = [0,2r √ k] sinh d-1 (x) dx [0,ε √ k/4] sinh d-1 (x) dx . We have bounds x ≤ sinh x ≤ xe x for nonnegative x,foot_9 which gives after integration vol(B k (2r)) vol(B k (ε/4)) ≤ [0,2r √ k] x d-1 e (d-1)x dx [0,ε √ k/4] x d-1 dx ≤ d (2r √ k) d [0,1] x d-1 e 2r √ k(d-1)x dx (ε √ k/4) d ≤ 16re 2r √ k ε d , where in the second line we change coordinates x → (2r √ k)x, and then use L ∞ control of the (monotone increasing) integrand in the second line to move to the expression in the third line. Remark C.6. The constant C M in Lemma C.5 can be sharpened if more is known about the curvature of M: if Ric ≥ 0, the exponential dependence on curvature and diameter can be removed (intuitively, taking k 0 "recovers" this from the proved result), and if Ric > 0, the dependence on diameter can be completely removed using Myers' theorem (Zhu, 1997, Theorem 3.4 (1)). Lemma C.7. For any x, x , x, x in S n0-1 , one has |∠(x, x ) -∠( x, x )| ≤ √ 2| x -x 2 -x -x 2 |. Proof. Writing ∠(x, x ) = cos -1 x, x = cos -1 (1-(1/2) x-x 2 2 ), consider the function f (x) = cos -1 (1 -(1/2)x 2 ) for x ∈ [- √ 2, √ 2], which is differentiable except possibly at 0. We calculate f (x) = x 1 -(1 -1 2 x 2 ) 2 = sign x 1 -1 4 x 2 , and taking limits at 0 shows that f admits left and right derivatives on all of [- √ 2, √ 2]. f is even- symmetric, so by checking values at 0 and √ 2 we conclude that |f | ≤ √ 2, which shows that f is √ 2-Lipschitz. The claim follows. Lemma C.8. Let d 0 = 1. Choose L so that L ≥ Kκ 2 C λ , where κ and C λ are respectively the curvature and global regularity constants defined in Section 2.1, and K, K > 0 are absolute constants. Then sup x∈M± M dµ ∞ (x ) (1 + (L/π)∠(x, x )) 2 ≤ Cρ max (len(M + ) + len(M -)) L , where C is an absolute constant and M ± denotes either M + or M -. Proof. Recall that γ + , γ -denote unit-speed curves parameterized with respect to arc length whose images are M + , M -. For convenience, define g(ν) = 1/(1 + Lν/π). We have sup x∈M± M (g(∠(x, x ))) 2 dµ ∞ (x ) ≤ sup x∈M± M+ (g(∠(x, x ))) 2 dµ ∞ + (x ) + sup x∈M± M- (g(∠(x, x ))) 2 dµ ∞ -(x ). (C.6) First, we note that |g| is strictly decreasing. We claim that for any x ∈ M -, there is a x ∈ M + such that ∠(x , x ) ≤ ∠(x, x ) for all x ∈ M + ; it is easy to see this is the case by choosing x to achieve the minimum in min x ∈M+ ∠(x, x ) and arguing by contradiction. By monotonicity of the integral, this implies sup x∈M± M+ (g(∠(x, x ))) 2 dµ ∞ + (x ) ≤ sup x∈M+ M+ (g(∠(x, x ))) 2 dµ ∞ + (x ), (C.7) and similarly for the term involving integration over M -. Therefore sup x∈M± M (g(∠(x, x ))) 2 dµ ∞ (x ) ≤ sup x∈M+ M+ (g(∠(x, x ))) 2 dµ ∞ + (x ) + sup x∈M-M- (g(∠(x, x ))) 2 dµ ∞ -(x ), (C.8) and it suffices to analyze these two terms. We bound the first term, since the second can be bounded by an identical argument. By compactness, the supremum in this term is attained at some x ∈ M + . Taking t such that γ + (t) = x, we can write sup x∈M+ M+ g(∠(x, x )) 2 dµ ∞ + (x ) ≤ ρ max S+ 0 g(∠(γ(t), γ(s))) 2 ds. (C.9) We split the interval [0, S + ] into two disjoint sub-intervals [0, S + ] ∩ [t -K τ / √ L, t + K τ / √ L] and [0, S + ] \ [t -K τ / √ L, t + K τ / √ L] , corresponding to "large scale" and "small scale" behavior, where K λ is the global regularity constant defined in (A.2). If we now assume 1 √ L ≤ c λ κ , then from (A.2) we obtain ∠(x, x ) ≤ 1 √ L ⇒ dist M (x, x ) ≤ K λ √ L and hence dist M (x, x ) > K λ √ L ⇒ ∠(x, x ) > 1 √ L . From the definition of g it follows that g( 1 √ L ) = 1 1 + √ L/π ≤ π √ L . Since |g| is a monotonically decreasing function we can bound the second integral in (C.9), obtaining s∈[0,S+]\[t * - K λ √ L ,t * + K λ √ L ] (g(∠(γ(s), γ(t * ))) 2 ds ≤ len(M + )C /L. (C.10) We next consider the remaining interval of integration in (C.9). Defining S + + = min K λ √ L , S + -t * , S - + = min K λ √ L , t * , and ν ± (s) = ∠(γ + (t * ± s), γ + (t * )) , the integral of interest can be written as ρ max s∈[0,S+]∩[t * -Kτ √ L ,t * + Kτ √ L ] (g(∠(γ(s), γ(t * )))) 2 ds = ρ max S + + s=0 (g(ν + (s))) 2 ds +ρ max S - + s=0 (g(ν -(s))) 2 ds. (C.11) It will be sufficient to consider the first integral here since the second one can be bounded in an identical fashion. We aim to show that the integral above is not too large. This will be the case if ν + (s) stays very small for a large range of values of s. To show that this is does not occur, we will use our bounds on the curvature of M to bound ν + (s) uniformly from below, which will in turn provide an upper bound on the integral. We will require an application of Lemma C.9, which will be applicable if S + + ≤ π κ . If L ≥ κ 2 K 2 τ π 2 we have S + + ≤ K λ √ L ≤ π κ . It follows immediately that Lemma C.9 applies to any restriction of γ + of length no larger than π κ . Next define by γ : [0, S + + ] → S n0-1 an unit-speed arc of curvature κ, and ν(s ) = ∠(γ(0), γ(s)). We claim that ∀s ∈ [0, S + + ] : ν + (s) ≥ ν(s). (C. 12) The proof is by contradiction. Assume there is some r such that ν + (r) < ν(r). (C.13) Now define by γ r : [0, r] → S n0-1 a restriction of γ + such that γ r (0) = γ + (t * ), γ r (s) = γ + (t * + s) , by γr an arc with curvature κ and the same endpoints as γ r , and by γr a restriction of γ with len(γ r ) = len(γ r ) = r. Note that ∠(γ r (0), γr (s)) = ν(r). However, an application of Lemma C.9 gives len(γ r ) ≤ len(γ r ) < len(γ r ) where the second inequality is because γr and γr have identical curvature at every point, and by assumption (C.13) the endpoints of γr are a greater geodesic (and hence euclidean) distance from each other than the endpoints of γr (which are a distance ν + (r) apart). This inequality contradicts the equality above it, and we conclude that no such r exists, and (C.12) holds. We have that |g| is a monotonically decreasing function, hence we can write for the first integral in (C.11) S + + s=0 (g(ν + (s))) 2 ds ≤ S + + s=0 (g(ν(s))) 2 ds. We now bound this integral. Since γ is an arc with curvature κ, from the proof of Lemma C.3 we have that ν is concave, and since ν(0) = 0 we can write ν(s) ≥ ν(S + + ) S + + s, and since |g| is monotonically decreasing S + + s=0 (g(ν(s))) 2 ds ≤ S + + s=0 g( ν(S + + ) S + + s) 2 ds = S + + ν(S + + ) ν(S + + ) s=0 (g(s)) 2 ds = S + + ν(S + + ) ν(S + + ) 1 + Lν(S + + )/π ≤π S + + Lν(S + + ) where we used the definition of g. It remains to show that S + + and ν(S + + ) are close. Since γ is an arc with curvature κ and length S + + , if we additionally assume L ≥ Kκ 2 C λ for some K chosen so that κ γ(0)-γ(S + + ) 2 2 ≤ κS + + 2 ≤ κC λ 2 √ L ≤ 1 2 , we obtain γ(0) -γ(S + + ) 2 ≤S + + = 2 κ sin -1 κ γ(0) -γ(S + + ) 2 2 ≤ γ(0) -γ(S + + ) 2 + κ 2 4 γ(0) -γ(S + + ) 3 2 γ(0) -γ(S + + ) 2 -S + + ≤ κ 2 4 γ(0) -γ(S + + ) 3 2 ≤ κ 2 4 S + + 3 ≤ κ 2 4 K 2 L S + + where in the first line we used sin -1 (x) ≤ x + x 3 for x. Since γ(0) -γ(S + + ) 2 ≤ ∠(γ(0), γ(S + + )) = ν(S + + ) ≤ S + + we obtain ν(S + + ) -S + + ≤ κ 2 4 K 2 λ L S + + and hence S + + ν(S + + ) ≤ S + + S + + -κ 2 4 K 2 λ L S + + = 1 1 -κ 2 4 K 2 λ L . We now choose L ≥ Kκ 2 K 2 τ for some K, so that the above term is smaller than 2. We therefore have Si s=0 (g(ν(s))) 2 ds ≤ C/L for some C. We can bound the second integral in (C.11) in an identical fashion. Combining this result with (C.10) and recalling (C.9), we obtain sup x∈M+ M+ (g(∠(x, x ))) 2 dµ ∞ (x ) ≤ C ρ max (len(M + ) + len(M -))/L for some constant, which completes the proof. Lemma C.9. Given a smooth, simple open curve in R n of length S with unit-speed parametrization γ : [0, S] → R n such that for some κ > 0 1. γ 2 ≤ κ 2. S ≤ π κ define by γ an arc of any circle of radius 1 κ such that γ(0) = γ(0), γ( S) = γ(S), S ≤ π κfoot_10 . We then have

S ≤ S

Proof. This result is a generalization of a well known comparison theorem of Schur's to higher dimensions following the proof in (Sullivan, 2008) , where we additionally specialize to the case where one of the curves is an arc. Given a curve γ satisfying the conditions of the lemma, we first consider an arc γ of a circle of radius 1 κ and length S, with a unit-speed parametrization. At the midpoint of this arc, the tangent vector γ ( S 2 ) is parallel to γ(S) -γ(0), hence γ(S) -γ(0) 2 = γ ( S 2 ), γ(S) -γ(0) = γ ( S 2 ), S 0 γ (t)dt . Similarly, for the curve γ we have γ(S) -γ(0) 2 ≥ γ ( S 2 ), γ(S) -γ(0) = γ ( S 2 ), S 0 γ (t)dt . Denoting the angle between tangent vectors γ (a), γ (b) = cos θ(a, b), we use the fact that for any smooth curve with unit-speed parametrization γ (t ) 2 = dθ ds (t, s) s=t . = |θ (t)|. This gives for any t ∈ [0, S/2] γ ( S 2 ), γ ( S 2 + t) = cos    S 2 +t S 2 θ (t )dt    ≥ cos    S 2 +t S 2 |θ (t )| dt    = cos    S 2 +t S 2 γ (t) 2 dt    ≥ cos   κ S 2 +t S 2 dt    = cos    S 2 +t S 2 γ (t) 2 dt    = γ ( S 2 ), γ ( S 2 + t) where we have used monotonicity of cos over the relevant range which is ensured by assumption 2, and a similar argument follows for t ∈ (0, -S/2]. Combining these inequalities gives γ(S) -γ(0) 2 ≥ γ(S) -γ(0) 2 . We have shown that, unsurprisingly, if the curvature of γ is bounded and it is not too long, then the distance between its endpoints is greater than that of a curve of equal length but larger curvaturenamely the arc γ. We now consider the arc γ defined in the lemma statement. If S > S, due to assumption 2 this would imply γ(S) -γ(0) 2 > γ( S) -γ(0) 2 = γ(S) -γ(0) 2 contradicting the inequality proved above. It follows that S ≤ S.

C.2.2 ANALYSIS OF THE SKELETON

Notation. Define ϕ (0) = Id, and for ∈ N define ϕ ( ) as the -fold composition of ϕ with itself, where ϕ(ν) = cos -1 1 -π -1 ν cos ν + π -1 sin ν is the heuristic angle evolution function. We will make use of basic properties of this function such as smoothness (established in Lemma E.5) below. In this section, we will study the skeleton ψ 1 (ν) = n 2 L-1 =0 cos ϕ ( ) (ν) L-1 = 1 -π -1 ϕ ( ) (ν) , ν ∈ [0, π], where we have not included the additive factor cos ϕ (L) (ν), as it is easily removed along the lines of Theorem B.2. We define ξ ( ) (ν) = L-1 = 1 -π -1 ϕ ( ) (ν) , = 0, . . . , L -1, so that ψ 1 (ν) = n 2 L-1 =0 cos ϕ ( ) (ν)ξ ( ) (ν). (C.14) We will also establish a convenient approximation to the skeleton. Define ψ(ν) = n 2 L-1 =0 ξ ( ) (ν). Lemma C.10 implies that ψ is convex; it is less trivial to obtain the same for ψ 1 . We will prove several estimates below for the terms ξ ( ) and their derivatives that can be used to immediately obtain useful estimates for ψ and its derivatives. Lemma C.10. For each = 0, 1, . . . , L, the functions ϕ ( ) are nonnegative, strictly increasing, and concave (positive and strictly concave on (0, π)); if 0 ≤ < L, the functions ξ ( ) , are nonnegative, strictly decreasing, and convex (positive and strictly convex on (0, π)). Proof. These claims are a consequence of some general facts for smooth functions that we articulate here so that we can rely on them often in the sequel. First, we have for any smooth function f : (0, π) → R (f • f ) = (f • f )f , (f • f ) = (f • f )f + (f ) 2 (f • f ). These equations show that if f > 0, f > 0, and f < 0, then f • f also satisfies these three properties. Lemma E.5 shows that ϕ satisfies these three properties on (0, π); we conclude from the mean value theorem and a simple induction the same for ϕ ( ) , as claimed. Meanwhile, if f, g are smooth real-valued functions on (0, π), we have (f g) = f g + g f, (f g) = f g + g f + 2f g . Thus, if f and g are both positive, strictly decreasing, strictly convex functions on (0, π), then f g also satisfies these three properties. Lemma E.5 implies that 0 < 1 -π -1 ϕ ( ) < 1 on (0, π), and the first and second derivatives are scaled and negated versions of those of ϕ ( ) ; we conclude by another induction that the same three properties apply to the functions ξ ( ) . Lemma C.11. There is an absolute constant C > 0 such that if L ≥ 12 and n ≥ L, then one has ψ 1 -ψ L ∞ ≤ Cn L . Proof. We have from the triangle inequality ψ 1 -ψ L ∞ ≤ sup ν∈[0,π] n 2 L-1 =0 cos ϕ ( ) (ν) -1 ξ ( ) (ν) ≤ n 2 L-1 =0 sup ν∈[0,π] cos ϕ ( ) (ν) -1 ξ ( ) (ν) , where we use Lemma C.10 to take ξ ( ) outside the absolute value. Notice that (cos ϕ ( ) -1)ξ ( ) ≤ 0, so to control the L ∞ norm of this term it suffices to bound it from below. We will show the monotonicity property (cos ϕ ( ) -1)ξ ( ) -(cos ϕ ( +1) -1)ξ ( +1) ≥ 0, (C.15) from which it follows ψ 1 -ψ L ∞ ≤ nL 2 sup ν∈[0,π] cos ϕ (L-1) (ν) -1 , using also ξ (L-1) (ν) ≤ 1. Since cos x ≥ 1 -(1/2)x 2 , and since Lemma C.12 gives that ϕ (L-1) ≤ C/(L -1) (and also estimates the constant), we have as soon as L ≥ 1 + C/ √ 2 ψ 1 -ψ L ∞ ≤ C 2 nL 4(L -1) 2 which gives the claim provided L ≥ 2 and n ≥ L. So to conclude, we need only establish (C.15). To this end, write the LHS of (C.15) as (cos ϕ ( ) -1)ξ ( ) -(cos ϕ ( +1) -1)ξ ( +1) = (cos ϕ ( ) -cos ϕ ( +1) ) - ϕ ( ) π (cos ϕ ( ) -1) ξ ( +1) to notice that it suffices to prove nonnegativity of the bracketed quantity. In addition, since ≥ 0 and ϕ(ν) ≤ ν by Lemma E.5, we can instead prove the inequality (cos x -cos ϕ(x)) - x π (cos x -1) ≥ 0 for all x ∈ [0, π]. Using the closed-form expression for cos ϕ(x) in Lemma E.2, we can plug into the previous inequality and cancel to get the equivalent inequality x -sin x ≥ 0. But this is immediate from the concavity estimate sin x ≤ x, and (C.15) is proved. Lemma C.12. If ∈ N 0 , one has the "fluid" estimate for the angle evolution function ϕ ( ) (ν) ≤ ν 1 + c ν , where c > 0 is an absolute constant. In particular, if ∈ N one has ϕ ( ) ≤ 1/c . Proof. The second claim follows from the first claim and 1 + c ν ≥ c ν, so we will focus on establishing the first estimate. The proof is by induction on ∈ N, since the case of = 0 is immediate. By Lemma E.5, there is a constant c 1 > 0 such that ϕ(ν) ≤ ν(1 -c 1 ν), and using the numerical inequality x(1 -x) ≤ x(1 + x) -1 , valid for x ≥ 0, we get ϕ(ν) ≤ ν 1 + c 1 ν , (C.16) which establishes the claim in the case = 1. Assuming the claim holds for -1, we calculate ϕ ( ) (ν) ≤ ϕ ( -1) (ν) 1 + c 1 ϕ ( -1) (ν) ≤ ν 1+c1( -1)ν 1 + c 1 ν 1+c1( -1)ν , where the first inequality uses (C.16), and the second inequality uses the induction hypothesis and the relation x(1 + x) -1 = 1 -(1 + x) -1 to see that x → x(1 + c 1 x) -1 is increasing. Clearing denominators in the numerator and denominator of the RHS of this last bound, we see that it is equal to ν/(1 + ν/π), and the claim follows by induction. Lemma C.13. If ∈ N 0 , the iterated angle evolution function satisfies the estimate ϕ ( ) (ν) ≥ ν 1 + ν/π . Proof. The proof is by induction on ∈ N, since the case = 0 is immediate. The case = 1 follows from Lemma C.14. Assuming the claim holds for -1, we calculate ϕ ( ) (ν) ≥ ϕ ( -1) (ν) 1 + ϕ ( -1) (ν)/π ≥ ν 1+( -1)ν/π 1 + 1 π ν 1+( -1)ν/π , where the first inequality applies Lemma C.14, and the second uses the fact that the RHS of the bound in Lemma C.14 is strictly increasing and the induction hypothesis. Clearing denominators in the numerator and denominator of the RHS of this last bound, we see that it is equal to ν/(1+ ν/π), and the claim follows by induction. Lemma C.14. It holds ϕ(ν) ≥ ν 1 + ν/π . Proof. After some rearranging using Lemma E.2, it suffices to prove 1 - ν π cos ν + sin ν π ≤ cos πν π + ν . (C.17) Using Lemma E.5, we see that both the LHS and RHS of this bound are nonincreasing. We will prove the estimate in three stages, using "small angle", "large angle", and "intermediate angle" estimates of the quantities on both sides of (C.17). Since πν/(π + ν) ∈ [0, π/2], we can use standard estimates for cos to get RHS estimates cos πν π + ν ≥ 1 - 1 2 πν π + ν 2 (C.18) and cos πν π + ν ≥ π -ν π + ν . (C.19) As for the LHS, we can obtain an estimate near ν = π in a straightforward way. Transforming the domain by ν → π -ν, it suffices to get estimates on sin ν -ν cos ν near ν = 0, then divide by π. Using cos ν ≥ 1 -(1/2)ν 2 and sin ν ≤ ν, it follows that sin ν -ν cos ν ≤ (1/2)ν 3 . We conclude 1 - ν π cos ν + sin ν π ≤ 1 2π (π -ν) 3 . (C.20) We will develop a second-order approximation to the LHS near 0 for the small-angle estimates. The first, second, and third derivatives of the LHS are  and  (2/π) cos ν + (1 -ν/π) sin ν, respectively. To bound the third derivative, we will use the estimate (1 -ν/π) sin ν, (1/π) sin ν -(1 -ν/π) cos ν, cos ν ≤ 1 -ν 2 /3 on [0, π/2]. To prove this, note that Taylor's formula implies the bound cos ν ≤ 1 -ν 2 /3 on [0, cos -1 (2/3)]; because cos is concave on [0, π/2], we also have the tangent line bound cos(ν) ≤ -ν √ 5/3 + (2/3 + √ 5 cos -1 (2/3)/3) on [0, π/2]. We can then solve for the zeros of the quadratic polynomial 1 - ν 2 /3 + ( √ 5/3)ν -(2/3 + √ 5 cos -1 (2/3)/3); a numerical evaluation shows that both roots are real and outside the interval [cos -1 (2/3), π/2]. Since the tangent line touches the graph of cos at ν = cos -1 (2/3), this proves that cos ν ≤ 1 -ν 2 /3 on [0, π/2]. We can therefore write 2 cos ν + (π -ν) sin ν ≤ 2(1 -ν 2 /3) + ν(π -ν), ν ∈ [0, π/2]. The RHS of this inequality is a concave quadratic; we calculate its maximum analytically as 2 + 3π 2 /20. Meanwhile, if ν ∈ [π/2, π], we have 2 cos ν ≤ 0, and (π -ν) sin ν ≤ π/2. We conclude that (2/π) cos ν + (1 -ν/π) sin ν ≤ 2 + 3π 2 /20 on [0, π]. Writing c = 1/(3π) + π/40, this implies an estimate 1 - ν π cos ν + sin ν π ≤ 1 - ν 2 2 + cν 3 . (C.21) Finally, we will need some estimates for interpolating the small and large angle regimes. We note that the second derivative (1/π) sin ν -(1 -ν/π) cos ν of the LHS of (C.17) is nonnegative if ν ≥ π/2, because cos ≥ 0 here; meanwhile, the third derivative (2/π) cos ν + (1 -ν/π) sin ν of the LHS of (C.17) is nonnegative if 0 ≤ ν ≤ π/2, since cos ≥ 0 here, and it follows that the second derivative is increasing on [0, π/2]. Checking numerically that the value of the second derivative at 1.42 is positive, we conclude that the LHS of (C.17) is convex on [1.42, π] . In addition, we use calculus to evaluate the first and second derivative of the RHS of (C.18) as -νπ 3 /(π + ν) 3 and -π 3 (π -2ν)/(π + ν) 4 , respectively; this shows that the RHS of (C.18) is convex for ν ≥ π/2, and concave for ν ≤ π/2. Taking a tangent line to the graph of the RHS of (C.18) at π/2, it follows that the function g(x) = 1 -(π 2 /2)ν 2 /(π + ν) 2 x ≤ π/2 -(4π/27)ν + (1 + π 2 /54) x ≥ π/2 (C.22) is a concave lower bound for the RHS of (C.18) on [0, π]. We proceed to using the estimates developed in the previous paragraph to prove (C.17). We first argue that for ν in a neighborhood of 0, we have 1 -ν 2 /2 + cν 3 ≤ 1 -(π 2 /2)ν 2 /(π + ν) 2 , which will in turn prove (C.17) in the same neighborhood. Cancelling and rearranging, it is equivalent to show (2/π -2c) -(4c/π -1/π 2 )ν -(2c/π 2 )ν 2 ≥ 0. The LHS is a concave quadratic, with value 2/π -2c > 0 at 0; we calculate its two distinct roots numerically as lying in the intervals [-5.1, -5]  and [1 .42, 1.43] , respectively. It follows that (C.17) holds for ν ∈ [0, 1.42]. Next, we argue that for ν in a neighborhood of π, we have 1 2π (π -ν) 3 ≤ π -ν π + ν , which will in turn prove (C.17) in the same neighborhood. Transforming with ν → π -ν and rearranging, it is equivalent to show ν 2 (2π -ν) ≤ 4π 2 in a neighborhood of 0. The LHS of this last inequality is 0 at 0, and nonnegative on [0, π]; its first and second derivatives are ν(4π -3ν) and 4π -6ν, respectively, which shows that it is a strictly increasing function of ν on [0, π]. Verifying numerically the three distinct real roots of ν 3 -2πν 2 + 1 = 0 and transferring the result back via another transformation ν → π -ν, we conclude that (C.17) holds on [π -1.1, π]. To obtain that (C.17) holds on [1.42, π -1.1], we use that the function g defined in (C.22) is a concave lower bound for the RHS of (C.18), so that it suffices to show that the LHS of (C.17) is upper bounded by g on [1.42, π -1.1]. The LHS of (C.17) is convex on [1.42, π], so it follows that it is sufficient to show that the values of the LHS of (C.17) at 1.42 and at π -1.1 are upper bounded by those of g at the same points. Confirming this numerically, we can conclude the proof. Lemma C.15. If ∈ N 0 , one has φ( ) (ν) ≤ 1 1 + (c/2) ν , where c > 0 is the absolute constant also appearing in Lemma E.5 (property 4), and in particular c/2 is equal to the absolute constant appearing in Lemma C.12. In particular, if ∈ N and ν ∈ [0, π] we have the estimate ν φ( ) (ν) ≤ 2 c . Proof. The case of = 0 follows directly (as an equality) from ϕ (0) (ν) = ν. Now we assume ∈ N. Smoothness of ϕ ( ) follows from Lemma E.5. Applying the chain rule and an induction, we have φ( ) = φ • ϕ ( -1) φ( -1) = -1 =0 φ • ϕ ( ) , (C.23) and applying the chain rule also gives φ( ) = φ( -1) 2 φ • ϕ ( -1) + φ( -1) φ • ϕ ( -1) . (C.24) By Lemma E.5, we have φ > 0 on [0, π]), and the formula (C.23) then implies that φ( ) > 0 on [0, π]) as well. Considering only angles in this half-open interval and distributing, it follows φ( ) φ( ) 2 = φ • ϕ ( -1) φ • ϕ ( -1) 2 + 1 φ • ϕ ( -1) φ( -1) φ( -1) 2 = φ φ2 • ϕ ( -1) + 1 φ • ϕ ( -1) φ( -1) φ( -1) 2 . Applying an induction using the previous formula and distributing in the result, we obtain φ( ) φ( ) 2 = -1 =0 1 -1 = +1 φ • ϕ ( ) φ φ2 • ϕ ( ) . (C.25) By Lemma E.5, we have 0 < φ ≤ 1 on [0, π) and φ ≤ 0. Thus - φ( ) φ( ) 2 ≥ - -1 =0 φ • ϕ ( ) . When > 0, we have ϕ ( ) ≤ π/2, and by Lemma E.5, we have φ ≤ -c < 0 on [0, π/2]; thus, -φ • ϕ ( ) ≥ c if > 0. When = 0, we can use the fact that φ ≤ 0 on [0, π] to get a bound φ ≤ -c1 [0,π/2] . We conclude - φ( ) φ( ) 2 ≥ c( -1) + c1 [0,π/2] . (C.26) Next, we notice using the chain rule that 1 φ( ) = - φ( ) φ( ) 2 , and using (C.23) and Lemma E.5, we have that φ( ) (0) = 1. For any ν ∈ [0, π), we integrate both sides of (C.26) from 0 to ν to obtain using the fundamental theorem of calculus 1 φ( ) (ν) -1 ≥ c( -1)ν + c ν 0 1 [0,π/2] (t) dt = c( -1)ν + c min{ν, π/2} ≥ c ν 2 , where in the final inequality we use the inequality min{ν, π/2} ≥ ν/2, valid for ν ∈ [0, π]. Rearranging, we conclude for any 0 ≤ ν < π φ( ) (ν) ≤ 1 1 + (c/2) ν , and noting that the LHS of this bound is equal to 0 at ν = π and the RHS is positive, we conclude the claimed bound for every ν ∈ [0, π]. The second estimate claimed follows by multiplying this bound by ν on both sides, and using 1 + (c/2) ν ≥ (c/2) ν. Lemma C.16. If ∈ N, one has φ( ) (ν) ≤ C 1 + (c/8) ν 1 + 1 (c/8)ν log (1 + (c/8)( -1)ν) , where C > 0 is an absolute constant, and c > 0 is the absolute constant also appearing in Lemma E.5 (property 4), and in particular c/2 is equal to the absolute constant appearing in Lemma C.12. If ν ∈ [0, π], the RHS of this upper bound is a decreasing function of ν, and moreover we have the estimates | φ( ) | ≤ C , ν 2 φ( ) (ν) ≤ Cπν 1 + (c/8) ν 1 + 8 log cπ ≤ 8πC c + 64C log c 2 . Proof. Smoothness follows from Lemma E.5; we make use of some results from the proof of Lemma C.15, in particular (C.23) and (C.25). We treat the case of = 1 first. By Lemma E.5, we have | φ| ≤ C for an absolute constant C > 0, and since 1/(1 + (c/2)ν) ≥ 1/(3/2) = 2/3 by the numerical estimate of the absolute constant c > 0 in Lemma E.5, it follows | φ(ν)| ≤ 3C/2 1 + (c/2)ν , which establishes the claim when = 1 (after worst-casing constants if necessary). Next, we assume > 1. Multiplying both sides of (C.25) by ( φ( ) ) 2 and cancelling using (C.23), we obtain φ( ) = -1 =0 -1 =0 φ • ϕ ( ) 2 -1 = +1 φ • ϕ ( ) φ • ϕ ( ) φ • ϕ ( ) 2 = -1 =0   -1 =0 φ • ϕ ( ) 2   -1 = +1 φ • ϕ ( ) φ • ϕ ( ) = φ( ) -1 =0 φ( ) φ • ϕ ( ) φ • ϕ ( ) , (C.27) where the last equality holds at least on [0, π), by Lemmas E.5 and C.15, and where empty products are defined to be 1. If > 0, we have ϕ ( ) ≤ π/2, and by Lemma E.5 we have that | φ| ≤ C and φ ≥ c > 0 on [0, π/2] for absolute constants C, C > 0. Separating the = 0 summand, this gives a bound φ( ) ≤ C -1 =1 φ • ϕ ( ) + C c φ( ) -1 =1 φ( ) . (C.28) By Lemma C.15, we have φ(ν) ≤ 1/(1 + (c/2)ν), and by Lemma E.5, we have ϕ(ν) ≤ ν, hence ϕ ( ) (ν) ≤ ν. Using concavity of ϕ, nonincreasingness of φ and nondecreasingness of ϕ ( ) (which follow from Lemma E.5) and a simple re-indexing, we can write -1 =1 φ • ϕ ( ) (ν) = -2 =0 φ • ϕ ( +1) (ν) = -2 =0 φ • ϕ ( ) • ϕ(ν) ≤ -2 =0 φ(ϕ ( ) (ν/2)) = φ( -1) (ν/2) ≤ 1 1 + (c/4)( -1)ν ≤ 1 1 + (c/8) ν where the third-to-last line follows from (C.23), the second-to-last line follows from Lemma C.15, and the last line follows from the inequality -1 ≥ /2 if ≥ 2. Following on from (C.28), we conclude by an application of Lemma C.15 φ( ) (ν) ≤ C 1 + (c/8) ν + C/c 1 + (c/2) ν -1 =1 1 1 + (c/2) ν ≤ C c 1 1 + (c/8) ν -1 =0 1 1 + (c/8) ν , where the last line simply worst-cases the constants. For any ∈ N 0 , the function x → 1/(1 + (c/8) x) is nonincreasing, so we can estimate the sum in the previous statement using an integral, obtaining φ( ) (ν) ≤ C/c 1 + (c/8) ν 1 + -1 0 1 1 + cνx dx ≤ C/c 1 + (c/8) ν 1 + 1 (c/8)ν log (1 + (c/8)( -1)ν) after evaluating the integral-we define the quantity inside the parentheses on the RHS of the final inequality to be -1 when ν = 0, which agrees with the integral representation in the previous line and with the unique continuous extension of the function on (0, π] to [0, π]-which establishes the first claim. We now move on to the study of the bound we have derived. For decreasingness, we note that the functions ν → C/c 1 + (c/8) ν , ν → 1 + 1 (c/8)ν log (1 + (c/8)( -1)ν) , (C.29) whose product is equal to our upper bound, are evidently both smooth nonnegative functions of ν at least on (0, π], so that by the product rule for differentiable functions it suffices to prove that these two functions are themselves decreasing functions of ν. The first function is evidently decreasing as an increasing affine reparameterization of ν → 1/ν; for the second function, after multiplying by the constant -1 and rescaling by a positive number (when = 1, the function is identically zero on (0, π], and the function's continuous extension as defined above equals 0 at 0 as well), we observe that it suffices to prove that x → x -1 log(1+x) is a decreasing function of ν on (0, ∞). The derivative of this function is x → (x -(1 + x) log(1 + x))/(x 2 (1 + x)), so it suffices to show that x -(1 + x) log(1 + x) ≤ 0. Noting that the function x → x log x is convex (its second derivative is 1/x), it follows that x -(1 + x) log(1 + x) is concave as a sum of concave functions, and is therefore has its graph majorized by its supporting hyperplanes; its derivative is equal to -log(1 + x), which equals 0 at 0, and we therefore conclude from our previous reduction that the second function in (C.29) is decreasing, and that our composite upper bound is as well. For the remaining estimates, we use the concavity estimate log(1 + x) ≤ x to obtain from our previous result φ( ) (ν) ≤ C 1 + (c/8) ν ≤ C , since the function x → C/(1 + cx) is nonincreasing for any choice of the constants. Next, we use the expression we have derived in the first claim to obtain ν 2 φ( ) (ν) ≤ Cν 1 + (c/8) ν ν + 1 (c/8) log (1 + (c/8)( -1)ν) . For any K > 0, the function x → x/(1 + Kx) is nondecreasing, and using the numerical estimate π(c/8) < 1 that follows from Lemma E.5, we obtain in addition 1 + π(c/8)( -1) ≤ for ∈ N. Thus ν 2 φ( ) (ν) ≤ Cπ 2 1 + (c/8) π 1 + log cπ/8 ≤ 8πC c + 64C log c 2 , as claimed. Lemma C.17. One has for every ∈ {0, 1, . . . , L} ϕ ( ) (0) = 0; φ( ) (0) = 1; φ( ) (0) = - 2 3π , and for every ∈ [L] φ( ) (π) = φ( ) (π) = 0. Finally, we have φ(0) (π) = 1 and φ(0) (π) = 0. Proof. The claims are consequences of Lemma E.5 when = 1, and of ϕ (0) = Id for smaller ; assume > 1 below. The claim for ϕ ( ) (0) follows from the fact that ϕ(0) = 0 and induction. For the claim about φ( ) (0), we calculate using the chain rule φ( ) (0) = φ(ϕ ( -1) (0)) φ( -1) (0) = φ(0) φ( -1) (0) = φ( -1) (0). By induction and Lemma E.5, we obtain φ( ) (0) = 1. The claim about φ( ) (π) follows from the same argument. For the remaining claims about φ( ) , we calculate using the chain rule φ( ) = φ( -1) 2 φ • ϕ ( -1) + φ( -1) φ • ϕ ( -1) , whence φ( ) (0) = φ(0) + φ( -1) (0). Using Lemma E.5 to get φ(0) = -2/(3π), this yields φ( ) (0) = - 2 3π . Similarly, since we have shown φ( -1) (π) = 0, we obtain φ( ) (π) = 0. Lemma C.18. For first and second derivatives of ξ ( ) , one has ξ( ) = -π -1 L-1 = φ( ) L-1 = = 1 -π -1 ϕ ( ) , and ξ( ) (C.30) = -π -1 L-1 =     φ( ) L-1 = = 1 -π -1 ϕ ( ) -π -1 φ( ) L-1 = = φ( ) L-1 = = , = 1 -π -1 ϕ ( )     , (C.31) where empty sums are interpreted as zero, and empty products as 1. In particular, one calculates ξ ( ) (0) = 1; ξ( ) (0) = - L - π ; ξ( ) (0) = (L -)(L --1) π 2 + L(L -1) -( -1) 3π 2 , and ξ (0) (π) = 0; ξ( ) (π) = - 1 π ξ (1) (π)1 =0 ; ξ( ) (π) = 0. Proof. The two derivative formulas are direct applications of the Leibniz rule to ξ ( ) . The claims about values at 0 follow from plugging the results of Lemma C.17 into our derivative formulas and the definition of ξ ( ) . For values at π, we first note that ϕ (0) (π) = π, from which it follows ξ (0) (π) = 0. Next, we use Lemma C.17 to get that φ( ) (π) = 0 for all ∈ {0, 1, . . . , L} and φ( ) (π) = 1 =0 to get ξ( ) (π) = -π -1 ξ (1) (π)1 =0 . For ξ( ) (π), we have ξ( ) = π -2 L-1 = φ( ) (π) = φ( ) (π) = , = 1 -π -1 ϕ ( ) (π) = π -2 1 =0 L-1 =1 φ( ) (π) = , =0 1 -π -1 ϕ ( ) (π) . If L = 1, the sum in the last expression is empty, and this quantity is 0. If L > 1, the sum is nonempty, and every summand is equal to zero by Lemma C.17. We conclude ξ( ) (π) = 0. Lemma C.19. If L ≥ 3, there exists an absolute constant 0 < C ≤ π/2 such that on the interval [0, C], one has for every = 0, 1, . . . , L -1 ... ξ ( ) ≤ 0. Proof. We consider functions only on [0, π/2] in this proof. Following the calculations in the proof of Lemma C.15, we have the expression ϕ • ϕ ( -1) (C.32) = φ • ϕ ( -1) ... ϕ ( -1) + 3 φ • ϕ ( -1) φ( -1) φ( -1) + ... ϕ • ϕ ( -1) φ( -1) 3 . (C.33) Using as well Lemma E.5, we have first and second derivative estimates 0 ≤ φ( ) ≤ 1 and -C 2 ≤ φ( ) ≤ -c 2 . By Lemma E.5, ... ϕ extends to a continuous function on [0, π/2], so in addition there exists a δ > 0 such that on [0, δ] we have . .. ϕ ≥ - 1 2π 2 (C.34) We lower bound (C.33) on [0, δ] using these estimates. For = 1, we can do no better than (C.34). For > 1, we can write ϕ • ϕ ( -1) ≥ φ • ϕ ( -1) ... ϕ ( -1) + 3 φ( -1) φ • ϕ ( -1) φ( -1) - 1 6π 2 φ( -1) 2 ≥ φ • ϕ ( -1) ... ϕ ( -1) + 3 φ( -1) c 2 2 ( -1) - 1 6π 2 . We have the numerical estimate c 2 = 0.14 from Lemma E.5, and we check numerically that (0.14) 2 > 1/6π 2 . This implies that on [0, δ] and for every ≥ 2, ... ϕ ( ) is lower bounded by a positive number plus a scaled version of ... ϕ ( -1) . We check precisely using the original formula (C.33) and Lemma E.5 for = 2 ... ϕ (2) (0) = 2 ... ϕ (0) + 3 φ(0) 2 = 2 3π 2 > 0, so that in particular ... ϕ (2) (0) + ... ϕ (1) (0) = 1 3π 2 > 0. By continuity, it follows that there is a neighborhood [0, δ ] on which we have ... ϕ (2) + ... ϕ (1) > 0. Thus, on [0, min{δ, δ }], we guarantee that simultaneously ... ϕ ( ) > 0 if ≥ 2; ... ϕ (2) + ... ϕ (1) > 0. Now we consider the third derivative of the skeleton summands ξ ( ) . Following the calculations of Lemmas C.10 and C.18, in particular applying the Leibniz rule, we observe that every term in the sum defining ... ξ ( ) that does not involve a third derivative of one of the factors (1 -(1/π)ϕ ( ) ) will be nonpositive, because (1 -(1/π)ϕ ( ) ) ≥ 0, φ( ) ≥ 0, and φ( ) ≤ 0. Meanwhile, by our calculations above, on the interval [0, min{δ, δ }], the only terms that can be positive are those with = 0 or = 1 where we differentiate the = 1 factor three times, i.e., the = 1 term in the sum - 1 π L-1 = ... ϕ ( ) L-1 = = 1 - ϕ ( ) π with = 0 or = 1. We will compare the = 1 summand with the = 2 summand: we have that the sum of these two terms equals - 1 π     L-1 = =1,2 1 - ϕ ( ) π     ... ϕ 1 - ϕ (2) π + ... ϕ (2) 1 - ϕ π . (C.35) At 0, the quantity inside the right parentheses is equal to ... ϕ + ... ϕ (2) > 0, by our calculations above. Thus, by continuity, there is a possibly smaller δ > 0 such that on [0, δ ], the sum of terms (C.35) is negative. We conclude that on [0, min{δ, δ , δ }], we have for every ≥ 0 ... ξ ( ) ≤ 0, and since we have chosen the neighborhood sizes δ, δ , δ independently of the depth L, we can conclude. Lemma C.20. For all ∈ {0, . . . , L -1}, one has ξ ( ) (ν) ≤ 1 + ν/π 1 + Lν/π . Proof. We have ξ ( ) (ν) = L-1 = 1 - ϕ ( ) (ν) π ≤ 1 - 1 π(L -) L-1 = ϕ ( ) (ν) L- ≤ exp - 1 π L-1 = ϕ ( ) (ν) , (C.36) where the first inequality applies the AM-GM inequality, and the second uses the standard exponential convexity estimate. Using Lemma C.13, we have - L-1 = ϕ ( ) (ν) ≤ - L-1 = ν 1 + ν/π ≤ - L ν 1 + ν/π d , where the last inequality uses the fact that → ν/(1 + ν/π) is nonincreasing for every ν ∈ [0, π] together with a standard estimate from the integral test. We calculate L ν 1 + ν/π d = π log 1 + Lν/π 1 + ν/π , which gives the claim after substituting into (C.36). Lemma C.21. For all ∈ {0, . . . , L -1}, one has | ξ( ) (ν)| ≤ 3 L - 1 + Lν/π . Proof. Using Lemma C.18, we have ξ( ) = - ξ (1) π 1 =0 - ξ ( ) π L-1 =max{ ,1} φ( ) 1 -ϕ ( ) /π where we directly treat the case = 0 to avoid dividing by zero at ν = π. The triangle inequality and Lemmas E.5 and C.20 then give | ξ( ) (ν)| ≤ 2 π   1 1 + Lν/π 1 =0 + ξ ( ) (ν) L-1 =max{ ,1} φ( ) (ν)   . Using Lemma C.15, we have L-1 = φ( ) (ν) ≤ L-1 = 1 1 + c ν ≤ 1 1 + c ν + L-1 1 1 + c ν d , where the last inequality uses the fact that → 1/(1 + c ν) is nonincreasing for every ν ∈ [0, π] together with a standard estimate from the integral test. Evaluating the integral, we obtain L-1 = φ( ) (ν) ≤ 1 1 + c ν + 1 cν log 1 + c(L -1)ν 1 + c ν , where the second term on the RHS is defined at ν = 0 by continuity. Using the standard concavity estimate log(1 + x) ≤ x, we have 1 cν log 1 + c(L -1)ν 1 + c ν = 1 cν log 1 + (L --1)cν 1 + c ν ≤ L --1 1 + c ν , whence L-1 = φ( ) (ν) ≤ L - 1 + c ν . (C.37) Combined with the result of Lemma C.20, we conclude ξ ( ) (ν) L-1 = φ( ) (ν) ≤ 1 cπ L - 1 + Lν/π . The numerical estimate c = 0.07 in Lemma E.5 then allows us to conclude | ξ( ) (ν)| ≤ 3 L - 1 + Lν/π , as claimed. Lemma C.22. One has |ψ 1 (ν)| ≤ 5nL 2 1 + Lν/π , |ψ (ν)| ≤ (3/2)nL 2 1 + Lν/π . Proof. We calculate using the chain rule ψ 1 = n 2 L-1 =0 ξ( ) cos ϕ ( ) -ξ ( ) φ( ) sin ϕ ( ) , and the triangle inequality gives |ψ 1 | ≤ n 2 L-1 =0 ξ( ) + ξ ( ) φ( ) . Applying Lemmas C.15, C.20 and C.21 and Lemma E.5 to estimate the constant c in Lemma C.15, we then obtain |ψ 1 (ν)| ≤ n 2(1 + Lν/π) L-1 =0 3(L -) + 1 + ν/π 1 + ν/(5π) ≤ n 2(1 + Lν/π) L-1 =0 3(L -) + 1 + 4 ν/(5π) 1 + ν/(5π) ≤ n 2(1 + Lν/π) 3L 2 2 + 5L ≤ 5nL 2 1 + Lν/π . The proof of the second claim is nearly identical, since in this case we need only use the bounds on | ξ( ) |. Lemma C.23. There are absolute constants c, C > 0 such that for all ∈ {0, . . . , L -1}, one has ξ( ) ≤ C L(L -)(1 + ν/π) (1 + cLν) 2 + C (L -) 2 (1 + cLν)(1 + c ν) . Proof. By Lemmas E.5 and C.18, we can write ξ( ) = - ξ ( ) π L-1 =max{1, } φ( ) 1 -ϕ ( ) π + 1 π 2     2ξ (1) 1 =0 L-1 =1 φ( ) 1 -ϕ ( ) π + ξ ( ) L-1 =max{1, } L-1 =max{1, } = φ( ) φ( ) 1 -ϕ ( ) π 1 -ϕ ( ) π     Focusing first on the second term, we have using Lemma E.5, (C.37) and Lemma C.20 2ξ (1) 1 =0 L-1 =1 φ( ) 1 -ϕ ( ) π + ξ ( ) L-1 =max{1, } L-1 =max{1, } = φ( ) φ( ) 1 -ϕ ( ) π 1 -ϕ ( ) π ≤4ξ (1) 1 =0 L-1 =1 φ( ) + 4ξ ( ) L-1 =max{1, } L-1 =max{1, } = φ( ) φ( ) . We can then write using nonnegativity L-1 = L-1 = = φ( ) φ( ) ≤ L-1 = L-1 = φ( ) φ( ) = L-1 = φ( ) 2 , and using (C.37) and Lemma C.20, we obtain thus ξ (1) 1 =0 L-1 =1 φ( ) + ξ ( ) L-1 =max{1, } L-1 =max{1, } = φ( ) φ( ) ≤ 3 cπ (L -) 2 (1 + Lν/π)(1 + c ν) . Regarding the first term, we have using Lemma C.16 L-1 = | φ( ) | ≤ C L-1 = 1 + (c/4) ν ≤ C L(L -) 1 + (c/4)Lν , because the function → /(1 + c ν) is nondecreasing. Applying also Lemma C.20, we obtain using the triangle inequality and worst-casing constants ξ( ) ≤ C 1 L(L -)(1 + ν/π) (1 + cLν) 2 + C 2 (L -) 2 (1 + cLν)(1 + c ν) . Lemma C.24. One has |ψ 1 (ν)| ≤ CnL 3 1 + cLν , |ψ (ν)| ≤ CnL 3 1 + cLν , where c, C > 0 are absolute constants. Proof. We calculate using the chain rule ψ 1 = n 2 L-1 =0 ξ( ) cos ϕ ( ) -2 ξ( ) φ( ) sin ϕ ( ) -ξ ( ) φ( ) sin ϕ ( ) -ξ ( ) φ( ) 2 cos ϕ ( ) , and the triangle inequality gives |ψ 1 | ≤ n 2 L-1 =0 ξ( ) + 2 ξ( ) φ( ) + ξ ( ) φ( ) + ξ ( ) φ( ) 2 . Using Lemmas C.15, C.16, C.20, C.21 and C.23 and worst-casing constants for convenience, we obtain from the last estimate |ψ 1 (ν)| ≤ Cn L-1 =0   L(L-)(1+ ν/π) (1+cLν) 2 + (L-) 2 (1+cLν)(1+c ν) + L- (1+Lν/π)(1+c ν) + 1+ ν/π 1+Lν/π 1 (1+c ν) 2 +   ≤ CnL 3 1 + cLν , where in the second line we made some estimates along the lines of the proof of Lemma C.22 and worsened the constant C. The proof for ψ follows from the same argument, since in this case we have the same sum of ξ( ) terms but none of the extra residuals.

D CONCENTRATION AT INITIALIZATION D.1 NOTATION AND FRAMEWORK

We recall the expression for the neural tangent kernel, as summarized in Appendix A.5.2: Θ(x, x ) = ∇f θ0 (x), ∇f θ0 (x ) = α L (x), α L (x ) + L-1 =0 α (x), α (x ) β (x), β (x ) , The objective of this section is to establish supporting results for the proof of Theorem B.2, which gives uniform concentration of Θ(x, x ) over M × M around the deterministic skeleton kernel. We take a pointwise-uniformize approach to proving this result: Appendix D.2 establishes concentration results for the constituents of Θ(x, x ) when x, x are fixed, and Appendix D.3 develops results that control the number of local support changes near points in a discretization of M × M in order to provide a suitable stand-in for the continuity properties necessary to uniformize these pointwise results. We collect relevant technical results and their proofs in Appendix D.4.

D.2 POINTWISE CONCENTRATION

We fix (x, x ) in this section, and generally suppress notation involving the specific points for concision. We separate our analysis into two distinct sub-problems: "forward concentration", which consists of the study of the correlations α (x), α (x ) , and "backward concentration", which consists of the study of the backward feature correlations β (x), β (x ) . Forward concentration is a prerequisite of our approach to backward concentration, so we begin there.

D.2.1 FORWARD CONCENTRATION

Notation. For = 0, 1, . . . , L, define random variables z 1 = α (x) 2 and z 2 = α (x ) 2 . With the convention 0 • +∞ = 0, we define for = 0, . . . , L, random variables ν by ν = cos -1 1 z 1 >0 1 z 2 >0 α (x) α (x) 2 , α (x ) α (x ) 2 -1 {z 1 =0}∪{z 2 =0} . These definitions guarantee that ν = π whenever either feature norm z i vanishes. These random variables are significant toward controlling Θ(x, x ) because, for each α (x), α (x ) = z 1 z 2 cos ν . Let us define pairs of gaussian vectors g 1 , g 2 ∼ i.i.d. N (0, (2/n)I) that are independent of everything else in the problem. For ≥ 1, we have by rotational invariance of the Gaussian distribution and the probability chain rule z 1 = W α -1 (x) + 2 d = g 1 + 2 z -1 1 . Since α 0 (x) = x and x 2 = 1, we have by an induction with analogous definitions z 1 d = =1 g 1 + 2 . Similarly, we have z 2 d = =1 g 1 + 2 . As for the angles, we have by rotational invariance z 1 z 2 = W α -1 (x) + 2 W α -1 (x ) + 2 d = g 1 + 2 g 1 cos ν -1 + g 2 sin ν -1 + 2 z -1 1 z -1 2 , so that an inductive argument gives z 1 z 2 d = =1 g 1 + 2 =1 g 1 cos ν -1 + g 2 sin ν -1 + 2 . We will write z 1 = =1 g 1 + 2 , z 2 = =1 g 1 cos ν -1 + g 2 sin ν -1 + 2 , and similarly ν = cos -1   1 z 1 z 2 >0 g 1 + g 1 + 2 , g 1 cos ν -1 + g 2 sin ν -1 + g 1 cos ν -1 + g 2 sin ν -1 + 2 -1 {z 1 =0}∪{z 2 =0}   , so that we obtain for the angles by a similar inductive argument ν d = ν . (D.1) For technical reasons, it will be convenient to consider an auxiliary angle process, defined for ≥ 1 as ν = cos -1   1 Ē (g 1 , g 2 ) g 1 + g 1 + 2 , g 1 cos ν -1 + g 2 sin ν -1 + g 1 cos ν -1 + g 2 sin ν -1 + 2   , (D.2) where we define with notation from Appendix E.1 Ē = i∈[n] (g 1 , g 2 ) ∀ν ∈ [0, 2π], 1 2 ≤ I [n]\{i} [g 1 cos ν + g 2 sin ν] + 2 ≤ 2 , and ν0 = ν 0 = x, x . We then observe =1 1 Ē g 1 , g 2 ≤ =1 1 z 1 z 2 >0 , since the inductive structure of z i implies that all feature norms are nonvanishing if and only if the top-level feature norms zL i are nonvanishing, and since the statement =1 1 Ē g 1 , g 2 = 1 implies by definition that zL 1 ≥ 2 -L and zL 2 ≥ 2 -L . By Lemma E.16, as long as n ≥ 21 the event Ē has overwhelming probability, and in particular a union bound implies P =1 1 z 1 z 2 >0 = 1 ≥ P =1 1 Ē g 1 , g 2 = 1 ≥ 1 -CLe -cn , (D.3) so that P ∀ = 1, 2, . . . , L, ν = ν ≥ 1 -CLe -cn . (D.4) We can therefore pass from ν to ν with negligible error. From the expression for ν , we see that the angles ν0 → ν1 → • • • → νL form a Markov chain, and we will control them using martingale techniques. For = 0, 1, . . . , L, we write F to denote the σ-algebra generated by the gaussian vectors (g 1 1 , g 1 2 , g 2 1 , g 2 2 , . . . , g 1 , g 2 ), so that (F 0 , . . . , F L ) is a filtration, and the sequences of random variables (ν 1 , . . . , νL ) and (ν 1 , . . . , νL ) are adapted to (F 1 , . . . , F L ). Moreover, with these definitions we have E ν F -1 = φ(ν -1 ), where φ is the angle evolution function defined in Appendix E.1, which is well-approximated by the function ϕ(ν) = cos -1 1 - ν π + sin ν π (see Lemmas E.1 and E.2). In the sequel, we will employ the notation ϕ ( ) to denote the -fold composition of ϕ with itself. By Lemma E.5, the function ϕ is smooth, and the chain rule implies the same for ϕ ( ) ; we will employ the notation φ( ) and φ( ) for the first and second derivatives of ϕ ( ) , respectively. Main results. Lemma D.1. There are absolute constants c, C, C > 0 and absolute constants K, K > 0 such that for any d ≥ K, if n ≥ K max{1, d 4 log 4 n, d 3 L log 3 n} then one has for any = 1, . . . , L P   α (x), α (x ) -cos ϕ ( ) (∠(x, x )) > C d 3 log 3 n n   ≤ C n -cd . Proof. We have α (x), α (x ) d = z 1 z 2 cos ν , and the triangle inequality (applied twice) then yields α (x), α (x ) -cos ϕ ( ) (ν 0 ) ≤ cos ν z 1 z 2 -1 + cos ν -cos ϕ ( ) (ν 0 ) ≤ z 2 z 1 -1 + z 2 -1 + ν -ϕ ( ) (ν 0 ) , where we also use |cos| ≤ 1 and that cos is 1-Lipschitz. Since z i d = z i for i = 1, 2, we obtain using Lemma D.2 and the choice n ≥ KdL P z i -1 > C d n ≤ C e -d , and as long as n ≥ C 2 dL, we obtain on one of the same events P z 2 ≤ 2 ≥ 1 -C e -d . By a union bound, we obtain P z 2 z 1 -1 + z 2 -1 ≤ 3C d n ≥ 1 -2C e -d , so that if we put d = d log n and therefore choose n ≥ C 2 dL log n, we have P z 2 z 1 -1 + z 2 -1 ≤ 3C d log n n ≥ 1 -2C n -d ≥ 1 -2C n -d , with the second bound holding if d ≥ 1 and n ≥ L. For the remaining term, we have by the triangle inequality ν -ϕ ( ) (ν 0 ) ≤ ν -ν + ν -ϕ ( ) (ν 0 ) . By (D.4), the first term on the RHS of the previous expression is equal to zero with probability at least 1 -CLe -cn as long as n ≥ 21. The second term can be controlled with Lemma D.3 provided we select n, L, d to satisfy the hypotheses of that lemma. We thus obtain via an additional union bound P   α (x), α (x ) -cos ϕ ( ) (ν 0 ) > 3C d log n n + C d 3 log 3 n n   ≤ C n -cd +C e -c n . If n ≥ (2/c ) log L and n ≥ (2c/c )d log n, we have C n -cd + C e -c n ≤ (C + C )n -cd . The previous bound then becomes P   α (x), α (x ) -cos ϕ ( ) (ν 0 ) > 3C d log n n + C d 3 log 3 n n   ≤ (C + C )n -cd , and if we worst-case the dependence on and d in the residual in the previous bound, we obtain P   α (x), α (x ) -cos ϕ ( ) (ν 0 ) > (3C + C ) d 3 log 3 n n   ≤ (C + C )n -cd , as claimed. Lemma D.2. There are absolute constants c, C, C > 0 and an absolute constant K > 0 such that for i = 1, 2, every = 1, . . . , L, and any d > 0, if n ≥ max{Kd , 4}, then one has P z i -1 > C d n ≤ C e -cd . Proof. Because z i d = z 1 , it suffices to show P -1 + =1 g 1 + 2 > C d n ≤ C e -cd . (D.5) The proof will proceed by showing concentration of the squared quantity =1 g 1 + 2 2 around 1, so that we can appeal to results like Lemma D.26, and then conclude by applying an inequality for the square root to pass to the actual quantity of interest. To enter the setting of Lemma D.26, it makes sense to normalize the factors in the product by their degree, but we must avoid dividing by zero. We have =1 g 1 + 0 = 0 if and only if =1 g 1 + 2 = 0, and whenever =1 g 1 + 2 = 0, we can write =1 g 1 + 2 = =1 g 1 + 2 2 1/2 (D.6) = =1 2 n g 1 + 0 1/2   =1 1 n 2 g 1 + 0 n 2 g 1 + 2 2   1/2 , (D.7) using 0-homogeneity of the 0 "norm". This leads to an extra product-of-degrees term; we will make use of Lemma D.27 to show that the product of degrees itself concentrates. We will also show that the event where a degree is zero is extremely unlikely and proceed with the degree-normalized main term by conditioning. By symmetry, the random variables g 1 + 0 are i.i.d. sums of n Bernoulli random variables with rate 1 2 . By Lemma G.1, we then have P g 1 + 0 < n/2 -t ≤ e -2t 2 /n , and so P min =1,..., g 1 + 0 < n/2 -t = P ∃ ∈ {1, . . . , } : g 1 + 0 < n/2 -t ≤ P g 1 + 0 < n/2 -t ≤ e -2t 2 /n , where the first inequality applies a union bound. Putting t = n/4, we conclude P min =1,..., g 1 + 0 < n/4 ≤ e -n/8 , so that whenever n ≥ 16 log , we have g 1 + 0 ≥ n/4 for every ≤ with probability at least 1 -e -n/16 . This gives us enough to begin working on showing concentration of the squared version of (D.5): partitioning, we can use the previous simplified bound to write P -1 + =1 g 1 + 2 2 > C d n (D.8) ≤ e -n/16 + P min =1,..., g 1 + 0 ≥ n/4, -1 + =1 g 1 + 2 2 > C d n . (D.9) Using (D.7) and the triangle inequality, we can write whenever no terms in the product vanish -1 + =1 g 1 + 2 2 = =1 2 n g 1 + 0   =1 1 n 2 g 1 + 0 n 2 g 1 + 2 2   -1 ≤ =1 2 n g 1 + 0   =1 1 n 2 g 1 + 0 n 2 g 1 + 2 2   -1 + =1 2 n g 1 + 0 -1 . (D.10) Moreover, we have by Lemma D.27 P -1 + =1 2 n g 1 + 0 > 4 d n ≤ 4 e -cd as long as n ≥ 128d . Choosing in addition n ≥ 4d and using nonnegativity, this implies P =1 2 n g 1 + 0 > 2 ≤ 4 e -cd , occurring on the same event. Combining the previous two bounds with (D.10) and (D.9) via another partition, we get P -1 + =1 g 1 + 2 2 > C d n ≤ e -n/16 + 4 e -cd + P      min =1,..., g 1 + 0 ≥ n/4, -1 + =1 1 √ n 2 [g 1 ] + 0 n 2 g 1 + 2 2 > (C/2 + 2) d n      , (D.11) where we use here that on the event {min =1,..., g 1 + 0 ≥ n/4}, the quantity =1 g 1 + 2 is nonzero almost surely, which allowed us to invoke the identities (D.7). For (k 1 , . . . , k ) ∈ [n] , we define events E k1 1 , . . . , E k by E k = n 2 g 1 + 0 = k . Conditioning and then relaxing the bounds, we can write P   min =1,..., g 1 + 0 ≥ n/4, -1 + =1 1 n 2 g 1 + 0 n 2 g 1 + 2 2 > C d n   ≤ (k1,...,k )∈[n] k ≥ n/4 P -1 + =1 1 √ n 2 [g 1 ] + 0 n 2 g 1 + 2 2 > C d n E k1 1 , . . . , E k * P E k1 1 , . . . , E k . Conditioned on E k1 1 , . . . , E k with k > 0, the random variable =1 n 2 g 1 + 2 2 / n 2 g 1 + 0 is distributed as a product of independent degree-normalized standard χ 2 random variables with minimum degree min{k 1 , . . . , k }. An application of Lemma D.26 then yields immediately P   -1 + =1 1 n 2 g 1 + 0 n 2 g 1 + 2 2 > C d n E k1 1 , . . . , E k   ≤ C e -cd as long as n ≥ K d , whence P   min =1,..., g 1 + 0 ≥ n/4, -1 + =1 1 n 2 g 1 + 0 n 2 g 1 + 2 2 > C d n   ≤ C e -cd . Combining this previous bound with (D.11) yields P -1 + =1 g 1 + 2 2 > C d n ≤ e -n/16 + C e -cd , where we worst-cased constants in the probability bound. If we choose n ≥ 4C 2 d , we have C d /n ≤ 1/2, and we obtain on the event in the previous bound P -1 + =1 g 1 + 2 2 > 1 2 ≤ e -n/16 + C e -cd . In particular, on the complement of the event in the previous bound, the product lies in [1/2, 3/2]. To conclude, we can linearize the square root near 1 to obtain an analogous bound for the product of the norms. Taylor expansion of the smooth function x → x 1/2 about the point 1 gives √ x -1 = 1 2 (x -1) - 1 8 k -3/2 (x -1) 2 , where k lies between x and 1. In particular, if x ≥ 1 2 , we have 1 2 (x -1) - 1 √ 2 (x -1) 2 ≤ √ x -1 ≤ 1 2 (x -1) , so that ( √ x -1) - 1 2 (x -1) ≤ 1 √ 2 (x -1) 2 . Thus, when x ≥ 1 2 we have by the triangle inequality √ x -1 ≤ 1 √ 2 (x -1) 2 + 1 2 |x -1|. from which we conclude based on a partition and our previous choices of large n P -1 + =1 g 1 + 2 > 2C d n ≤ 2e -n/16 + 2C e -cd , which yields the claimed probability bound when n ≥ 16d. Lemma D.3. There are absolute constants c, C, C 0 > 0 and absolute constants K, K > 0 such that for any L max ∈ N and any d ≥ K, if n ≥ K max{1, d 4 log 4 n, d 3 L max log 3 n}, then one has P   ∃L ∈ [L max ] : νL -ϕ (L) (ν 0 ) > C 0 d 3 log 3 n nL   ≤ Cn -cd . (D.12) Proof. The proof uses a recursive construction involving L ∈ [L max ]. Before beginning the main argument, we will define the key quantities that appear and enforce bounds on the parameters to obtain certain estimates. For each L ∈ [L max ], we define the event E L =    νL -ϕ (L) (ν 0 ) > C 0 d 3 log 3 n nL    , where C 0 > 0 is an absolute constant whose value we will specify below, so that E L ∈ F L , and our task is to produce an appropriate measure bound on L∈[Lmax] E L . For notational convenience, we also define E 0 = ∅. For each L ∈ [L max ] and each ∈ [L], we define ∆ L = ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ), so that for every L, ∆ 1 L , . . . , ∆ L L is adapted to the sequence F 1 , . . . , F L , and we have the decomposition νL -ϕ (L) (ν 0 ) = L =1 ∆ L . In particular, we have E L =    L =1 ∆ L > C 0 d 3 log 3 n nL    . The sequences (∆ L ) ∈L are not quite martingale difference sequences, but we will show they are very nearly so: writing ∆ L = ∆ L -E ∆ L F -1 ∆ L +E ∆ L F -1 , we have that ( ∆ L ) ∈L is a martingale difference sequence, which can be controlled using truncation and martingale techniques, and the extra conditional expectation term can be controlled analytically. In particular, we have the following estimates: by Lemma D.24, we have if n ≥ max{K 1 log 4 n, K 2 L max } that for every L ∈ [L max ] and every ∈ [L] E ∆ L F -1 ≤ C 1 log n n ν -1 1 + (c 0 /64)(L -)ν -1 (1 + log L) + C 2 1 n 2 ; (D.13) by the first result in Lemma D.25 we have for every d ≥ max{K 3 , 6/c 1 } that if n ≥ K 4 d 4 log 4 n, then for every L ∈ [L max ] and every ∈ [L] (and after worsening constants) P ∆ L > 2C 3 d log n n ν -1 1 + (c 0 /64)(L -)ν -1 + 2C 2 n 2 F -1 ≤ C 5 n -c1d ; (D.14) and by the second result in Lemma D.25, we have by our previous choices of n, d, and L max that for every L ∈ [L max ] and every ∈ [L] (after worsening constants) E ∆ L 2 F -1 ≤ 4C 2 3 d log n n ν -1 1 + (c 0 /64)(L -)ν -1 2 + C 4 n 4 . (D.15) The main line of the argument will consist of showing that a measure bound of the form (D.12) on ∈[L-1] E implies one of the same form on ∈[L] E . For any L ∈ [L max ], on the event E c L we have νL ≤ ϕ (L) (ν 0 ) + C 0 d 3 log 3 n nL ≤ 2 c 0 L + C 0 d 3 log 3 n nL ≤ 3 c 0 L , (D.16) where the second inequality follows from Lemma C.12, and the third follows from the choice n ≥ (C 0 c 0 ) 2 d 3 L log 3 n. In particular, if we make the choice n ≥ (C 0 c 0 ) 2 d 3 L max log 3 n, we have (D.16) on E c L for every L ∈ [L max ]. Accordingly, for every L ∈ [L max ] and every ∈ [L] we define truncation events G L by G L = ∆ L ≤ 2C 3 d log n n ν -1 1 + (c 0 /64)(L -)ν -1 + 2C 2 n 2 ∩ E c -1 . (D.17) We have G L ∈ F , and a union bound and (D.14) imply P     ∈[L] G L   c F L-1   ≤ C 5 Ln -c1d + P   ∈[L-1] E F L-1   = C 5 Ln -c1d + 1 ∈[L-1] E , where the second line uses the fact that E ∈ F . In particular, taking expectations recovers P     ∈[L] G L   c   ≤ C 5 Ln -c1d + P   ∈[L-1] E   . (D.18) In addition, by (D.16) we have on E c -1 ν -1 1 + (c 0 /64)(L -)ν -1 ≤ 3 c 0 1 ( -1) + (3/64)(L -) = 3 c 0 1 (3/64)L + (61/64) -1 ≤ 64 c 0 (L -1) ≤ 128 c 0 L , where the final inequality requires L ≥ 2. Thus, when L ≥ 2, we have on G L that ∆ L ≤ 256C 3 c 0 L d log n n + 2C 2 n 2 ≤ 512C 3 c 0 2K0 d log n nL 2 , (D.19) where the final inequality holds when d ≥ 1 and n ≥ (C 2 c 0 /128C 3 ) 2/3 L 2/3 . Similarly, when L ≥ 2, on E c -1 we have by (D.15) E ∆ L 2 F -1 ≤ 2 16 C 2 3 c 2 0 d log n nL 2 + C 4 n 4 ≤ 2 17 C 2 3 c 2 0 d log n nL 2 = 2K 2 0 d log n nL 2 , (D.20) where the second inequality holds when d ≥ 1 and n ≥ (C 4 c 2 0 /2 17 C 2 3 ) 1/3 L 2/3 ; and in the same setting we have by (D.13) E ∆ L F -1 ≤ 128C 1 c 0 (1 + log L) log n nL + C 2 n 2 ≤ 256C 1 c 0 (1 + log L) log n nL , (D.21) where the second inequality holds when n ≥ (C 2 c 0 /128C 1 )L. In particular, if we enforce these conditions with L max in place of L, we have that (D.19) to (D.21) hold for all 2 ≤ L ≤ L max (with (D.20) and (D.21) holding on E c -1 ). We begin the recursive construction. We will enforce C 0 = max{4πC 3 , 6K 0 } for the absolute constant in the definition of E . The main tool is the elementary identity P   ∈[L] E   = P   ∈[L-1] E   + P   E L ∩ ∈[L-1] E c   , (D.22) which allows us to leverage an inductive argument provided we can control P[E L ∩ ∈[L-1] E c ], the probability that the L-th angle deviates above its nominal value subject to all prior angles being controlled. The case L = 1 can be addressed directly: (D.14) gives P ∆ 1 1 > 2πC 3 d log n n + 2C 2 n 2 ≤ C 5 n -c1d , and as long as d ≥ 1 and n ≥ (C 2 /πC 3 ) 2/3 , this implies P ∆ 1 1 > 4πC 3 d log n n ≤ C 5 n -c1d . (D.23) This gives a suitable measure bound on E 1 , after choosing d ≥ 1 and n ≥ e so that d 3 log 3 n ≥ d log n. We now assume L ≥ 2. By the triangle inequality, we have L =1 ∆ L ≤ L =1 ∆ L + L =1 E ∆ L F -1 , (D.24) and we therefore have for any t > 0 P   L =1 ∆ L > t ∩ ∈[L-1] E c   ≤ P   L =1 ∆ L + L =1 E ∆ L F -1 > t ∩ ∈[L-1] E c   = P 1 ∈[L-1] E c L =1 ∆ L + 1 ∈[L-1] E c L =1 E ∆ L F -1 > t . (D.25) By (D.21), we have 1 ∈[L-1] E c L =1 E ∆ L F -1 ≤ 256C 1 c 0 (1 + log L) log n n . (D.26) For the remaining term, we have by the triangle inequality L =1 ∆ L ≤ L =1 ∆ L -∆ L 1 G L + L =1 ∆ L 1 G L -E ∆ L 1 G L F -1 + L =1 E ∆ L 1 G L F -1 -E ∆ L F -1 . (D.27) By (D.17), an integration of (D.14), and a union bound, we have P 1 ∈[L-1] E c L =1 ∆ L -∆ L 1 G L > 0 ≤ P   ∈[L] ∆ L > 2C 3 d log n n ν -1 1 + (c 0 /64)(L -)ν -1 + 2C 2 n 2   ≤ C 5 Ln -c1d , (D.28) and we have L =1 E ∆ L 1 G L F -1 -E ∆ L F -1 ≤ L =1 E ∆ L 1 (G L ) c F -1 ≤ π L =1 P G L c F -1 ≤ π L =1 P ∆ L > 2C 3 d log n n ν -1 1 + (c 0 /64)(L -)ν -1 + 2C 2 n 2 ∪ E -1 F -1 ≤ πC 5 Ln -c1d + π L-1 =1 1 E , where the first line uses linearity of the conditional expectation and the triangle inequality for sums and for the integral; the second line uses the worst-case bound of π on the magnitude of the increments ∆ L ; the third line uses (D.17); and the fourth line uses a union bound, E -1 ∈ F -1 , and (D.14). Multiplying both sides of the final bound by 1 ∈[L-1] E c , we conclude 1 ∈[L-1] E c L =1 E ∆ L 1 G L F -1 -E ∆ L F -1 ≤ 1 ∈[L-1] E c πC 5 Ln -c1d ≤ πC 5 Ln -c1d . (D.29) For the remaining term in (D.27), we first observe E ∆ L 1 G L -E ∆ L 1 G L F -1 2 F -1 ≤ E ∆ L 2 1 G L F -1 ≤ E ∆ L 2 F -1 , where the first line uses the centering property of the L 2 norm, and the second line uses ∆ L 2 ≥ 0 to drop the indicator for G L . For notational simplicity, we define V L = L =1 E ∆ L 1 G L -E ∆ L 1 G L F -1 2 F -1 , so that our previous bound and (D.20) imply ∈[L-1] E c ⊂ V L ≤ 2K 2 0 d log n nL . This implies that for any t > 0 P 1 ∈[L-1] E c L =1 ∆ L 1 G L -E ∆ L 1 G L F -1 > t = P      ∈[L-1] E c    ∩ L =1 ∆ L 1 G L -E ∆ L 1 G L F -1 > t   ≤ P V L ≤ 2K 2 0 d log n nL ∩ L =1 ∆ L 1 G L -E ∆ L 1 G L F -1 > t . The previous term can be controlled using Lemma G.5 and (D.19): P V L ≤ 2K 2 0 d log n nL ∩ L =1 ∆ L 1 G L -E ∆ L 1 G L F -1 > t ≤ 2 exp   - t 2 /2 2K 2 0 d log n nL + (2K 0 /3)t d log n nL 2   . Setting t = 3K 0 d 3 log 3 n/nL, we obtain P   1 ∈[L-1] E c L =1 ∆ L 1 G L -E ∆ L 1 G L F -1 > 3K 0 d 3 log 3 n nL   ≤ 2 exp - 9 4 d 2 log 2 n 1 + d log n √ L ≤ 2n -(9/8)d , (D.30) where the last line uses the bounds L ≥ 1 and d log n/(1 + d log n) ≥ 1 2 if d ≥ 1 and n ≥ e. Combining (D.28) to (D.30) in (D.27) via a union bound, we obtain P   1 ∈[L-1] E c L =1 ∆ L > 3K 0 d 3 log 3 n nL + πC 5 Ln -c1d   ≤ C 5 Ln -c1d + 2n -(9/8)d . Applying this result and (D.26) to (D.25) via a union bound, we obtain P      L =1 ∆ L > 3K 0 d 3 log 3 n nL + πC 5 Ln -c1d + 256C 1 c 0 (1 + log L) log n n    ∩ ∈[L-1] E   ≤ C 5 Ln -c1d + 2n -(9/8)d . If d ≥ 2/c 1 and n ≥ L max , we have C 5 Ln -c1d ≤ C 5 n -c1d/2 ; under these condition on d and n, we have πC 5 Ln -c1d ≤ πc 5 n -1 , and so πC 5 n -c1d/2 + (256C 1 /c 0 )(1 + log L)(log n)/n ≤ C(1 + log L)(log n)/n; and if d ≥ 1 and n ≥ L max , we have 3K 0 d 3 log 3 n nL ≥ C (1 + log L) log n n provided n ≥ C (C/3K 0 ) 2 L max log L max . Under these conditions, our previous bound simplifies to P      L =1 ∆ L > 6K 0 d 3 log 3 n nL    ∩ ∈[L-1] E   ≤ (2 + C 5 )n -min{c1/2,9/8}d . In particular, applying this bound to (D.22), we have shown that for any L ≥ 2 P   ∈[L] E   = P   ∈[L-1] E   + (2 + C 5 )n -min{c1/2,9/8}d . Unraveling the recursion with (D.23) (and worst-casing the constants there), we conclude P   ∈[L] E   ≤ (2 + C 5 )Ln -min{c1/2,9/8}d , which proves the claim, after possibly choosing n to be larger than another absolute constant multiple of L max to remove the leading L factor.

D.2.2 BACKWARD FEATURE CONTROL

Having established concentration of the feature norms and the angles between them, it remains to control the inner products of backward features that appear in Θ. The core of the technical approach will once again be martingale concentration. We establish the following control on the backward feature inner products: Lemma D.4. Fix x, x ∈ S n0-1 and denote ν = ∠(x, x ). If n ≥ max {KL log n, K Ld b , K }, d b ≥ K log L for suitably chosen K, K , K , K then P L-1 =0 β (x) 2 2 ≤ Cn ≥ 1 -e -c n L . If additionally n, L, d satisfy the requirements of lemmas D.3 and E.16, we have P L-1 =0 β (x), β (x ) - n 2 L-1 i= 1 - ϕ (i) (ν) π ≤ log 2 (n) √ d 4 Ln ≥ 1 -e -cd where ϕ (i) denotes i applications of the angle evolution function defined in lemma E.2, and c > 0, C are absolute constants. Proof. For ∈ [L], write F for the σ-algebra generated by all the weights up to layer in the network, i.e., W 1 , . . . , W , with F 0 given by the trivial σ-algebra. Consider some β (x), β (x ) for 0 ≤ ≤ L -1. Defining Γ : (x) = P I (x) W P I -1 (x) . . . P I (x) W , B : xx = Γ : +2 (x)P I +1 (x) P I +1 (x ) Γ : +2 * (x ), for ∈ { + 1, . . . , L}, and setting Γ +1: +2 (x) = I,B : xx = 1 2 I, we define the event ẼL+1: B = B L: xx 2 F ≤ C 2 nL ∩ B L: xx ≤ CL ∩ tr B L: xx ≤ Cn . Since β (x), β (x ) = W L+1 B L: xx W L+1 * is a Gaussian chaos in terms of the W L+1 vari- ables (and recalling W L+1 ∼ iid N (0, I)) and ẼL B is F L -measurable, the Hanson-Wright inequality (lemma G.4) gives P 1 ẼL+1: B β (x), β (x ) -tr B L: xx ≥ C √ tnL ≤ 2 exp -c min t, tn L ≤ 2e -ct . Using lemma D.28 to bound P ẼL+1: B c from above gives P β (x), β (x ) -tr B L: xx > C √ tnL ≤P 1 ẼL+1: B β (x), β (x ) -tr B L: xx ≥ C √ tnL + P 1 ẼL+1: B c β (x), β (x ) -tr B L: xx ≥ C √ tnL ≤P 1 ẼL+1: B β (x), β (x ) -tr B L: xx ≥ C √ tnL + P ẼL+1: B c ≤2e -ct + Cn -c n L ≤ C e -c t (D.31) for appropriately chosen constant, if t n log n/L. Choosing t = n/L in the bound above and using the bound on tr B L: xx from lemma D.28 we obtain P β (x) 2 2 ≥ 2Cn ≤ P β (x) 2 2 -tr B L xx > Cn + P tr B L xx > Cn ≤ Ce -c n L + C nL 2 e -c n L ≤ C nL 2 e -c n L for appropriate constants. Taking a union bound over the possible values of proves the first part of the lemma. We next control tr B L: xx -n 2 L-1 = 1 -ϕ ( ) (ν) π using martingale concentration (in a similar manner to the control of the angles established in previous sections). We write tr B L: xx - n 2 L-1 i= 1 - ϕ (i) (ν) π (D.32) = L-1 = L-1 i= +1 1 - ϕ (i) (ν) π tr B +1: xx -1 - ϕ ( ) (ν) π tr B : xx ≡ L = +1 ∆ (D.33) (note the change in the indexing). Consider the filtration F 0 ⊂ • • • ⊂ F L and adapted sequence ∆ = ∆ -E ∆ |F -1 , (D.34) so that L = +1 ∆ = L = +1 ∆ + L = +1 E ∆ |F -1 . (D.35) We begin by considering the first term in the sum since it takes a distinct form. Denoting by W +1 (:,i) the i-th column of W +1 , rotational invariance of the Gaussian distribution gives tr B +1: xx =tr P I +1 (x) P I +1 (x ) =tr P W +1 α (x)>0 P W +1 α (x )>0 d =tr P W +1 (:,1) >0 P W +1 (:,1) cos ν +W +1 (:,2) sin ν >0 and hence E tr B +1: xx F = E W +1 tr B +1: xx = n E g1,g2 1 g1>0 1 g1 cos ν +g1 sin ν >0 where (g 1 , g 2 ) ∼ N (0, I). Moving to spherical coordinates, we obtain E W +1 tr B +1: xx = n 2π ∞ 0 π/2 -π 2 +ν e -r 2 /2 rdrdθ = n 2 1 - ν π . We now note that conditioned on F , tr P W 0 (:,1) >0 P W 0 (:,1) cos ν+W 0 (:,2) sin ν>0 is a sum of n independent variables taking values in {0, 1}. An application of Bernstein's inequality for bounded random variables (lemma G.3) then gives P ∆ +1 > √ nd =P L-1 i= +1 1 - ϕ (i) (ν) π tr B +1: xx - n 2 1 - ν π > √ nd ≤2 exp -c nd n + √ nd ≤ 2e -c d (D.36) for some c , where we used the fact that the angle evolution function ϕ is bounded by π/2. Note also from Lemma D.3 that P   E ∆ +1 |F - n 2 L-1 i= 1 - ϕ (i) (ν) π > d 3 log 3 (n)n L   =P   n 2 L-1 i= +1 1 - ϕ (i) (ν) π ν π - ϕ ( ) (ν) π > d 3 log 3 (n)n L   ≤e -cd (D.37) for some constant c, where we assumed d > K log n for some K. Having controlled the first term in (D.35), we now proceed to bound the remaining terms. We define events E : B = α -1 (x) 2 α -1 (x ) 2 > 0 ∩ ϕ ( -1) (ν) -ν -1 ≤ C d 3 log 3 n n ∩ tr B -1: xx ≤ Cn ∩ B -1: xx 2 F ≤ C 2 n ∩ B -1: xx ≤ C , (which from lemma D.28 hold with high probability). Note that as a consequence of the first event in E : B the angle ν is well-defined. Note that E : B is F -1 -measurable. We will first control (D.35) by considering each summand truncated on the respective event E : B . Our task is therefore to control L = +2 1 E : B ∆ + L = +2 1 E : B E∆ |F -1 . Since E 1 E : B ∆ = E E 1 E : B ∆ F -1 = E 1 E : B E ∆ F -1 = 0, the first sum is over a zero-mean adapted sequence and hence a martingale, and can thus be controlled using the Azuma-Hoeffding inequality. We will first show that the remaining term is small. We begin by computing 1 E : B Etr B : xx |F -1 = E W tr 1 E : B B -1: xx W * P W α -1 (x )>0 P W α -1 (x)>0 W where we used the fact that E : B ∈ F -1 and is thus independent of W . There exists a matrix R such that Rα -1 (x) = α -1 (x) 2 e 1 , Rα -1 (x ) = α -1 (x ) 2 e 1 cos ν -1 + e 2 sin ν -1 .

Rotational invariance of the Gaussian distribution gives

W α -1 (x) d = W (:,1) α (x) 2 , W α -1 (x ) d = α -1 (x ) 2 W (:,1) cos ν -1 + W (:,2) sin ν -1 , where we denote by W (:,i) the i-th column of W . Defining B -1: xx = RB -1: xx R * we have Etr B : xx |F -1 = E W tr 1 E : B B -1: xx W * P W (:,1) >0 P W (:,1) cos ν -1 +W (:,2) sin ν -1 >0 W =1 E : B B -1: : xx ji E W ijk W ki 1 W k1 >0 1 W k1 cos ν -1 +W k2 sin ν -1 >0 W kj =1 E : B n i,j=1 B -1: : xx ji E W nW 1i 1 W 11 >0 1 W 11 cos ν -1 +W 12 sin ν -1 >0 W 1j . = n i,j=1 1 E : B B -1: : xx ji Q -1 ij . If i / ∈ {1, 2} we get (with the square brackets denoting indicators) Q -1 ij = E W 2δ ij W 11 > 0 W 11 cos ν -1 + W 12 sin ν -1 > 0 = δ ij 1 - ν -1 π . If i ∈ {1, 2} then the Q -1 ij = 0 only if j ∈ {1, 2}. In these cases we have Q -1 11 = E W n W 11 2 W 11 > 0 W 11 cos ν -1 + W 12 sin ν -1 > 0 =2 E W g 2 1 [g 1 > 0] g 1 cos ν -1 + g 2 sin ν -1 > 0 where (g 1 , g 2 ) ∼ N (0, I). Moving to spherical coordinates, we obtain Q -1 11 = 1 π ∞ 0 π/2 -π 2 +ν -1 e -r 2 /2 r 3 cos 2 θdrdθ = π -ν -1 + sin ν -1 cos ν -1 π , and similarly Q -1 22 = 1 π ∞ 0 π/2 -π 2 +ν e -r 2 /2 r 3 sin 2 θdrdθ = π -ν -1 -sin ν -1 cos ν -1 π Q -1 12 = Q -1 21 = E W nW 11 W 11 > 0 W 11 cos ν -1 + W 12 sin ν -1 > 0 W 12 = 1 π ∞ 0 π/2 -π 2 +ν -1 e -r 2 /2 r 3 sin θ cos θdrdθ = 1 2π π/2 -π 2 +ν -1 sin θ cos θdθ = sin 2 ν -1 2π . Combining terms and using tr B -1: xx = tr B -1: xx we obtain 1 E : B E tr B : xx |F -1 = 1 E : B         π -ν -1 π tr B -1: xx + sin ν -1 cos ν -1 π B -1: xx 11 -B -1: xx 22 + sin 2 ν -1 2π B -1: xx 12 + B -1: xx 21         , hence 1 E : B E W tr B : xx -1 - ϕ -1 (ν) π tr B -1: xx =1 E : B      ϕ -1 (ν)-ν -1 π tr B -1: xx + sin ν -1 cos ν -1 π B -1: xx 11 -B -1: xx 22 + sin 2 ν -1 2π B -1: xx 12 + B -1: xx 21      On E : B , the bound on ϕ -1 (ν) -ν -1 and lemma C.12 give ν -1 ≤ C a.s.. Additionally, on this event max i,j∈[n] B -1: xx ij ≤ B -1: xx ≤ C a.s.. It follows that 1 E : B E W tr B : xx -1 - ϕ -1 (ν) π tr B -1: xx (D.38) ≤ C 2 d 3 n log 3 n + 2C 2 π + C 3 π ≤ C d 3 n log 3 n (D.39) almost surely, and hence restoring a constant factor with magnitude bounded by 1, we have 1 E : B E∆ |F -1 (D.40) = L-1 i= 1 - ϕ (i) (ν) π 1 E : B E W tr B : xx -1 - ϕ -1 (ν) π tr B -1: xx (D.41) ≤ a.s. C d 3 n log 3 n . (D.42) Using lemma D.28 to bound P E : B c from above then gives P   E∆ |F -1 > C d 3 n log 3 n   < P E : B c ≤ C n -cd . (D.43) An application of the triangle inequality and union bound then give P L = +2 E∆ |F -1 > C d 3 Ln log 3 n ≤P L = +2 E∆ |F -1 > C d 3 Ln log 3 n ≤ L = +2 P   E∆ |F -1 > C d 3 n log 3 n L   ≤ L = +2 P E : B c ≤CLn -cd (D.44) for some constants c, C. We proceed to control the remaining terms in (D.35), namely L = +2 ∆ . Aiming to apply martingale concentration, we require an almost sure bound on the summands, which we achieve by truncation. Towards this end, we define an event G =    |∆ | ≤ C √ d + C d 3 n log 3 n    . Combining (D.43) and the result of lemma D.29 (after taking an expectation) we have P [G ] ≥ 1 -P   E∆ |F -1 > C d 3 n log 3 n   -P ∆ > C √ d ≥ 1 -C n -cd -C e -c d ≥ 1 -C e -c d (D.45) for appropriate constants. We now decompose the sum that we would like to bound: L = +2 ∆ ≤ L = +2 ∆ -∆ 1 G Σ1 + L = +2 ∆ 1 G -E ∆ 1 G |F -1 Σ2 + L = +2 E ∆ 1 G |F -1 -E ∆ |F -1 Σ3 . (D.46) Since each summand in Σ 1 are equal to zero on the respective truncation event, a union bound and (D.45) give P L = +2 ∆ -∆ 1 G > 0 ≤ P L = +2 G c ≤ L = +2 P [G c ] ≤ LCe -cd (D.47) for some constants. The term Σ 2 is a sum of almost surely bounded martingale differences. We can apply the Azuma-Hoeffding inequality (lemma G.8) directly to conclude P L = +2 ∆ 1 G -E∆ 1 G |F -1 > d 2 nL log 3 n log L (D.48) ≤ exp      - d 4 nL log 3 n log L 2 L = +2 C √ d + C d 3 n log 3 n 2      ≤ e -cd . (D.49) Considering a single summand in Σ 3 , Jensen's inequality and the Cauchy-Schwarz inequality give E ∆ 1 G -∆ | F -1 = E ∆ 1 G c F -1 ≤ a.s. E |∆ | 1 G c F -1 ≤ a.s. E 1 G c F -1 1/2 E ∆ 2 F -1 1/2 . (D.50) This is an F -1 -measurable function, and we can show that it is small on the event E : B ∈ F -1 . To control the first factor, we note that 1 E : B E 1 G c F -1 =E 1 G c ∩E : B F -1 =P      |∆ | > C √ d + C d 3 n log 3 n    ∩ E : B F -1   ≤ P 1 E : B ∆ > C √ d + C d 3 n log 3 n ∩ E : B F -1 +P 1 (E : B ) c ∆ > 0 ∩ E : B F -1 ≤P   1 E : B ∆ > C √ d + C d 3 n log 3 n F -1   ≤ P 1 E : B ∆ > C √ d F -1 +P 1 E : B E∆ |F -1 > C d 3 n log 3 n F -1 ≤ a.s. P 1 E : B ∆ > C √ d ≤ a.s. C e -cd (D.51) where to obtain the second to last line we used the definition of ∆ , then used Lemma D.29 and (D.40) to bound the first and second term almost surely. We proceed to control the second factor in (D.50), by bounding 1 E : B E ∆ 2 F -1 =1 E : B L-1 i= 1 - ϕ (i) (ν) π 2 E W tr B : xx -1 - ϕ ( -1) (ν) π tr B -1: xx 2 ≤ a.s. 1 E : B E W tr B : xx -1 - ϕ ( -1) (ν) π tr B -1: xx 2 ≤ a.s. 4 E W     1 E : B tr B : xx -E W tr B : xx 2 + 1 E : B E W tr B : xx -1 -ϕ ( -1) (ν) π tr B -1: xx 2     . Using (D.38) and (D.148) to bound the integrand above, we have P 1 E : B E ∆ 2 F -1 > C d + d 3 n log 3 n ≤ C e -cd for appropriate constants. Combining (D.51) and the above bound gives P   1 E : B E 1 G ∆ -∆ | F -1 > C d + d 3 n log 3 n e -cd   ≤ C e -c d for some c, c , C, C , and using lemma D.28 P   E 1 G ∆ -∆ | F -1 > C d + d 3 n log 3 n e -cd   ≤ C e -c d + P (E : B ) c ≤ C e -c d + C n -c d ≤ C e -c d for appropriate constants. An application of the triangle inequality and a union bound (and introducing some slack to simplify the expression) then gives P L = +2 E 1 G ∆ -∆ | F -1 > CL d 3 n log 3 ne -cd ≤ C Le -cd for some constants. Combining this bound with (D.47) and (D.48) gives P L = +2 ∆ > d 2 nL log 3 n log L + CL d 3 n log 3 ne -cd ≤ C Le -c d + e -cd + C Le -c d ≤ C Le -c d P L = +2 ∆ > Cd 2 Ln log 3 n log L ≤ C Le -cd . where in the last inequality we assumed K log L ≤ d. Combining this with (D.44), we obtain P L = +2 ∆ > Cd 2 Ln log 3 n log L ≤ C Ln -c d + C Le -c d ≤ C Le -cd for appropriate constants. This bound all the terms in the sum (D.32) aside from the first one. The first term is bounded in (D.36), (D.37), and the fluctuations due to the last layer are bounded in (D.31) . Combining all of these gives P β (x), β (x ) - n 2 L-1 i= 1 - ϕ (i) (ν) π > Cd 2 Ln log 3 n log L ≤P β (x), β (x ) -tr B L-1: xx > C 4 d 2 Ln log 3 n log L + P |∆ +1 | ≤ C 3 d 2 Ln log 3 n log L + P L = +2 ∆ ≤ C 4 d 2 Ln log 3 n log L + P L = +2 E∆ |F -1 ≤ C 4 d 2 Ln log 3 n log L ≤ C Le -cd after worsening constants. A final union bound over and assuming d ≥ K log L gives P L-1 =0 β (x), β (x ) - n 2 L-1 i= 1 - ϕ (i) (ν) π ≥ Cd 2 Ln log 3 n log L ≥ 1-C e -cd for appropriately chosen c, C , K. If we additionally assume n > L we obtain the desired result.

D.3 UNIFORMIZATION ESTIMATES D.3.1 NETS AND COVERING NUMBERS

We appeal to Lemma C.4 to obtain estimates for the covering number of M, which we will use throughout this section. In the remainder of this section, we will use the notation N ε to denote the ε-nets for M constructed in Lemma C.4, and for any x ∈ N ε , we will also use the notation N ε ( x) = B( x, ε)∩M , where ∈ {+, -} is the component of x, to denote the relevant connected neighborhood of the specific point in the net we are considering. Here we are implicitly assuming that M ± are themselves connected, but this construction evidently generalizes to cases where M ± themselves have a positive number of connected components, as treated in Lemma C.4. Focusing on this simpler case in the sequel will allow us to keep our notation concise.

D.3.2 CONTROLLING SUPPORT CHANGES UNIFORMLY

The quantities we have studied in Appendix D.2 are challenging to uniformize due to discontinuities in the support projections P I ( • ) . We will get around this difficulty by carefully tracking (with high probability) how much the supports can change by when we move away from the points in our net N ε . It seems intuitively obvious that when ε is exponentially small in all problem parameters, there should be almost no support changes when moving away from our net; the challenge is to show that this property also holds when ε is not so relatively small. Introduce the following notation for the network preactivations at level , where ∈ [L]: ρ (x) = W α -1 (x), so that α (x) = [ρ (x)] + . We also let F denote the σ-algebra generated by all weight matrices up to level in the network, and let F 0 denote the trivial σ-algebra. Definition D.1. Let ε, ∆ > 0, and let x ∈ N ε . For ∈ [L], a feature (α ( x)) i is called ∆-risky if |(ρ ( x)) i | ≤ ∆; otherwise, it is called ∆-stable. If for all x ∈ N ε ( x) we have ∀ ∈ [ ], ρ (x) -ρ ( x) ∞ ≤ ∆, we say that stable sign consistency holds up to layer . We abbreviate this condition as SSC( , ε, ∆) at x, with the dependence on x, ε, and ∆ suppressed when it is clear from context. If SSC( ) holds at x and if (α ( x)) i is stable, we can write for any x ∈ N ε ( x) sign (ρ (x)) i = sign (ρ ( x)) i + (ρ (x)) i -(ρ ( x)) i = sign (ρ ( x)) i , so that no stable feature supports change on N ε ( x), and we only need to consider changes due to the risky features. Moreover, observe that P (ρ ( x)) i ∈ {±∆} = E P α -1 (x) 2 e i , g ∈ {±∆} F -1 = 0, (D.52) where g ∼ N (0, (2/n)I) is independent of everything else in the problem, since ∆ > 0. It follows that when considering the network features over any countable collection of points x ∈ M, we have almost surely that the risky features are witnessed in the interior of [-∆, +∆]. Below, we will show that with appropriate choices of ε and ∆, with very high probability: (i) each point in the net x has very few risky features; and (ii) SSC(L) holds uniformly over the net under reasonable conditions involving n, L, d. We write R ( x, ∆) ⊂ [n] for the random variable consisting of the set of indices of ∆-risky features at level with input x ∈ N ε . Lemma D.5. There is an absolute constant K > 0 such that for any x ∈ M and any d > 0, if n ≥ max{KdL, 4} and ∆ ≤ d log n/(6n 3/2 L), then one has P L =1 |R ( x, ∆)| > d log n ≤ 2n -d + L 2 e -cn/L . Proof. For any x ∈ N ε , Lemma D.2 (with a suitable choice of d in that context) gives P α ( x) 2 -1 > 1 2 ≤ C e -c n , so that if additionally n ≥ (2/c) log(C), one has P α ( x) 2 -1 > 1 2 ≤ e -c n . (D.53) Let G = {1/2 ≤ α ( x) 2 ≤ 2}, so that G is F -measurable, and G = ∩ ∈[L-1] G ; then by (D.53 ) and a union bound, we have P[G] ≥ 1 -L 2 e -cn/L . We also let G 0 = ∅ c . For i ∈ [n] and ∈ [L], consider the random variables X i = |(ρ ( x)) i |, and moreover define Xi = X i α -1 ( x) 2 1 G -1 . We have i, 1 X i ≤∆ = |R ( x)|, which is the total number of ∆-risky features at x, and the corresponding sum with the random variables Xi is thus an upper bound on the number of risky features at x. Notice that X i and Xi are F -measurable, and additionally notice that on G, we have X i /2 ≤ Xi ≤ 2X i . For any K ∈ {0, 1, . . . , nL -1, nL}, we have by disjointness of the events in the union and a partition P   i, 1 X i ≤∆ > K   ≤ L 2 e -cn/L + nL k=K+1 P   G ∩    i, 1 X i ≤∆ = k      ≤ L 2 e -cn/L + nL k=K+1 P   G ∩    i, 1 Xi ≤2∆ = k      , so it is essentially equivalent to consider the Xi . By another partitioning we can write P   G ∩    i, 1 Xi ≤2∆ = k      = S∈{0,1} n×L : S 2 F =k E L =1 1 G -1 n i=1 1 1 Xi ≤2∆ =S i where {0, 1} n×L is the set of n × L matrices with entries in {0, 1}. Using the tower rule and F L-1 -measurability of all factors with < L, we can then write E L =1 1 G -1 n i=1 1 1 Xi ≤2∆ =S i = E E L =1 1 G -1 n i=1 1 1 Xi ≤2∆ =S i F L-1 = E L-1 =1 1 G -1 n i=1 1 1 Xi ≤2∆ =S i 1 G L-1 E n i=1 1 1 XiL ≤2∆ =S iL F L-1 . We study the inner conditional expectation as follows: because ρ L ( x) = W L α L-1 ( x), we can apply rotational invariance in the conditional expectation to obtain 1 G L-1 E n i=1 1 1 XiL ≤2∆ =S iL F L-1 = 1 G L-1 E n i=1 1 1 |(w L ) i |1 G -1 ≤2∆ =S iL F L-1 = 1 G L-1 E n i=1 1 1 |(w L ) i |≤2∆ =S iL F L-1 , where w L ∼ N (0, (2/n)I) is the first column of W L , and the last equality takes advantage of the presence of the indicator for 1 G -1 multiplying the conditional expectation. We then write using independence E n i=1 1 1 |(w L ) i |≤2∆ =S iL F L-1 = P n i=1 1 |(w L )i|≤2∆ = S iL F L-1 = n i=1 P 1 |(w L )i|≤2∆ = S iL F L-1 , and putting p L = P[|(w L ) 1 | ≤ 2∆], we have by identically-distributedness n i=1 P 1 |(w L )i|≤2∆ = S iL F L-1 = n i=1 p S iL L (1 -p L ) 1-S iL . After removing the indicator for G L-1 by nonnegativity of all factors in the expectation, this leaves us with E L =1 1 G -1 n i=1 1 1 Xi ≤2∆ =S i ≤ n i=1 p S iL L (1 -p L ) 1-S iL E L-1 =1 1 G -1 n i=1 1 1 Xi ≤2∆ =S i . This process can evidently be iterated L-1 additional times with analogous definitions-we observe that the fact that all weight matrices W have the same column distribution implies that p 1 = • • • = p L , so we write p = p 1 henceforth-and by this we obtain E L =1 1 G -1 n i=1 1 1 Xi ≤2∆ =S i ≤ L =1 n i=1 p S iL (1 -p) 1-S iL , and in particular P   G ∩    i, 1 Xi ≤2∆ = k      ≤ S∈{0,1} n×L : S F =k L =1 n i=1 p S iL (1 -p) 1-S iL . For i ∈ [n] and ∈ [L], let Y i denote nL i.i.d. Bern(p) random variables; we recognize this last sum as the probability that i, Y i = k. In particular, using our previous work we can assert for any t > 0 P   i, 1 X i ≤∆ > t   ≤ P   i, Y i > t   + L 2 e -cn/L , so to conclude it suffices to articulate some binomial tail probabilities and estimate p. We have p = P[|(w L ) i | ≤ 2∆] = n 2π 2∆ 0 exp - nt 2 4 dt ≤ ∆ 2n π , (D.54) and we can write with the triangle inequality and a union bound P   i, Y i > t   ≤ P   i, Y i -E[Y i ] > t   + P   i, E[Y i ] > t   . By (D.54), we have i, E[Y i ] ≤ n 3/2 L∆. We calculate using independence E      i, Y i -E[Y i ]   2    ≤ i, E[Y i ] ≤ n 3/2 L∆, so an application of Lemma G.3 yields P   i, Y i -E[Y i ] > t   ≤ 2 exp - t 2 /2 n 3/2 L∆ + t/3 . For any d > 0, if we choose t = d log n and enforce ∆ ≤ d log n/(6n 3/2 L), we obtain P   i, Y i > d log n   ≤ 2n -d , from which we conclude as sought P   i, 1 X i ≤∆ > d log n   ≤ 2n -d + L 2 e -cn/L . The next task is to study the stable sign condition at a point x as a function of ε and ∆, assuming ∆ at least satisfies the hypotheses of Lemma D.5. In particular, we will be interested in conditions under which we can guarantee that SSC( -1) holding implies that SSC( ) holds. Let S ( x, ∆) = [n] \ R ( x, ∆) denote the ∆-stable features at level with input x, and define for 0 ≤ ≤ ≤ L T , x = P S ( x) P I (x) W P S -1 ( x) P I -1 (x) W -1 . . . P S +1 ( x) P I +1 (x) W +1 ; Φ , x = W T -1, x , (D.55) so that Φ , x x carries an input x ∈ N ε ( x) applied at the features at level (in particular, = 0 corresponds to α 0 (x) = x, the network input) to the preactivations at level in a network restricted to only the stable features at x. We can write ρ (x) = W P I -1 (x) W -1 . . . P I1(x) W 1 x; α (x) = P I (x) W P I -1 (x) W -1 . . . P I1(x) W 1 x, which gives us a useful representation if we disregard all levels with no risky features: let r = L =1 1 |R ( x,∆)|>0 be the number of levels in the network with risky features, and let 1 < 2 < • • • < r denote the levels at which risky features occur. If no risky features occur at a level , we of course have P S ( x) = I. Assume to begin that > r , and start by writing ρ (x) = Φ , r x P S r ( x) + P R r ( x) P I r (x) Φ r , r -1 x P S r -1 ( x) + P R r -1 ( x) P I r -1 (x) . . . . . . Φ 2, 1 x P S 1 ( x) + P R 1 ( x) P I 1 (x) Φ 1,0 x . Now we distribute from left to right, and recombine everything to right on the term corresponding to the projection onto the risky features at r ; this gives ρ (x) = Φ , r x P R r ( x) α r (x) + Φ , r -1 x P S r -1 ( x) + P R r -1 ( x) P I r -1 (x) . . . . . . Φ 2, 1 x P S 1 ( x) + P R 1 ( x) P I 1 (x) Φ 1,0 x . We can write Φ , r x P R r ( x) α r (x) = Φ , r x R r ( x) α r (x) R r ( x) , where the restriction notation emphasizes that we are considering a column submatrix of the transfer operator induced by the risky features. Iterating the previous argument, we obtain ρ (x) = Φ ,0 x x + r i=1 Φ , i x R i ( x) α i (x) R i ( x) . It is clear that an analogous argument can be used in the case where ≤ r by adapting which risky features can be visited: we can thus assert ρ (x) = Φ ,0 x x + i∈[r] : i< Φ , i x R i ( x) α i (x) R i ( x) . (D.56) Furthermore, we note that under SSC( -1), no stable feature supports change on N ε ( x), and so one has for every x ∈ N ε ( x) Φ , x = Φ , x , so under SSC( -1) we have by (D.56) x) . ρ (x) -ρ ( x) = Φ ,0 x (x -x) + i∈[r] : i< Φ , i x R i ( x) α i (x) R i ( x) -α i ( x) R i (

The ReLU [•]

+ is 1-Lipschitz with respect to • ∞ , and by monotonicity of the max under restriction and SSC( -1) we have α i (x) R i ( x) -α i ( x) R i ( x) ∞ ≤ ρ i (x) R i ( x) -ρ i ( x) R i ( x) ∞ ≤ ρ i (x) -ρ i ( x) ∞ ≤ ∆. Thus, by the triangle inequality, we have under SSC( -1) a bound ρ (x) -ρ ( x) ∞ ≤ ε Φ ,0 x 2 → ∞ + ∆ i∈[r] : i< Φ , i x R i ( x) ∞ → ∞ . (D.57) This suggests an inductive approach to establishing SSC( ) provided we have established it at previous layers-we just need to control the transfer coefficients in (D.57). Lemma D.6. There are absolute constants c, c , C, C , C , C > 0 and absolute constants K, K > 0 such that for any 1 ≤ < ≤ L, any d ≥ K log n and any x ∈ S n0-1 , if ∆ ≤ cn -5/2 and n ≥ K max{d 4 L, 1}, then one has P Φ ,0 x 2 → ∞ ≤ C(1 + n 0 n ) ≥ 1 -C e -cd , and for any fixed S ⊂ [n], one has P Φ , x S ∞ → ∞ ≤ C |S| d n ≥ 1 -C e -c d . Proof. We will use Lemmas D.14 and D.23 to bound the transfer coefficients, so let us first verify the hypotheses of these lemmas. In our setting, the transfer matrices differ only from the 'nominal' transfer matrices by restriction to the stable features at x; we have S ( x) ∩ I ( x) = [n] \ R ( x), which is an admissible support random variable for Lemmas D.14 and D.23, and in particular P S ( x) P I ( x) -P I ( x) ρ ( x) 2 = P R ( x) α ( x) 2 ≤ √ n∆ by Lemma G.10 and the definition of R ( x). Additionally, using Lemma D.5, we have if d ≥ 1, n ≥ KdL, and ∆ ≤ cd/n 5/2 , there is an event E with measure at least 1 -2e -d -L 2 e -cn/L on which there are no more than d risky features at x. Worsening constants in the scalings of n if necessary and requiring moreover d ≥ K log n and n ≥ K d 4 , it follows that we can invoke Lemmas D.14 and D.23 to bound the probability of events involving transfer coefficients multiplied by 1 E . Let us also check the residuals we will obtain when applying Lemma D.23: in the notation there, the vector d has as its -th entry R ( x), and so we have bounds d 1/2 ≤ d 2 1 and d 1 = R ( x), which means on the event E, the residual is dominated by the C √ d 4 nL term in the scalings we have assumed. The 2 → ∞ operator norm of a matrix is the maximum 2 norm of a row of the matrix, and the ∞ → ∞ operator norm is the maximum 1 norm of a row. Thus Φ ,0 x 2 → ∞ = max i=1,...,n e * i Φ ,0 x 2 = max i=1,...,n (W ) * i T -1,0 x 2 where (W ) * i is the i-th row of W , which is n 0 -dimensional when = 1 and n-dimensional otherwise. In particular, we have Φ 1,0 x 2 → ∞ = max i=1,...,n (W 1 ) i 2 , and so taking a square root and applying Lemma G.2 and independence of the rows of W 1 , we have P Φ 1,0 x 2 → ∞ ≤ 2(1 + n 0 n ) ≥ 1 -2ne -cn , for c > 0 an absolute constant. When > 1, we can write max i=1,...,n (W ) * i T -1,0 x 2 = max i=1,...,n (W ) * i T -1,1 x P I1( x) W 1 2 ≤ W 1 max i=1,...,n (W ) * i T -1,1 x P I1( x) 2 , where the second line applies Cauchy-Schwarz. Using, say, rotational invariance, Gauss-Lipschitz concentration, and Lemma E.48 (or (Vershynin, 2018, Theorem 4.4.5 )), we have P W 1 > C(1 + n 0 n ) ≤ 2e -cn for absolute constants c, C > 0. On the other hand, note that (W ) * i T -1,1 x P I1( x) 2 has the same distribution as the square root of one of the index-0 diagonal terms studied in Lemma D.23 in a network truncated at level -1 instead of L and scaled by 2/n; and so applying this result together with a union bound and the choice n ≥ max{K, K d 4 L} for absolute constants K, K > 0 gives P 1 E max i=1,...,n (W ) * i T -1,1 x P I1( x) 2 > C ≤ C ne -c d where C , c , C > 0 are absolute constants. We conclude by another union bound P Φ ,0 x 2 → ∞ ≤ C(1 + n 0 n ) ≥ 1 -2e -cn -C ne -c d -C L 2 e -c n/L . We can reduce the study of the partial risky propagation coefficients to a similar calculation. We have Φ , x S ∞ → ∞ = max j=1,...,n (W ) * j T -1, x S 1 , where by construction we have that > . In the case = + 1, the form is slightly different; we have Φ +1, x S ∞ → ∞ = max j=1,...,n W +1 S j 1 ≤ |S| max j=1,...,n W +1 S j ∞ , where the inequality uses Lemma G.10. The classical estimate for the gaussian tail gives P (W +1 ) jk > 2d n ≤ 2e -d , (D.58) for each k ∈ [n], so a union bound gives P max j=1,...,n W +1 S j ∞ > 2d n ≤ 2ne -d , and we conclude P Φ +1, x S ∞ → ∞ ≤ |S| 2d n ≥ 1 -2e -d/2 , where the final bound holds if d ≥ 2 log n. Next, we assume > + 1. In this case, Lemma G.10 gives Φ , x S ∞ → ∞ = max j=1,...,n (W ) * j T -1, x S 1 ≤ |S| max j=1,...,n (W ) * j T -1, x S ∞ . For the second term on the RHS of the inequality, we write max j=1,...,n (W ) * j T -1, x S ∞ = max j=1,...,n max k∈S (W ) * j T -1, x e k then apply rotational invariance of the distribution of (W ) j and F -1 -measurability of T -1, x S to obtain max j=1,...,n max k∈S (W ) * j T -1, x e k = max j=1,...,n max k∈S : T -1, x e k 2 >0 (W ) * j T -1, x e k d = max j=1,...,n max k∈S |g j | T -1, x e k 2 , where g ∼ N (0, (2/n)I) is independent of everything else in the problem. We have by Lemma D.14 based on our previous choices of n and d  P 1 E T -1, x e k 2 ≤ C ≥ 1 -C e -cn/ W ) * j T -1, x S ∞ > C d n ≤ 2ne -d + C ne -cn/L + 2e -d + L 2 e -c n/L and thus P Φ , x S ∞ → ∞ > C|S| d n ≤ C e -d/2 + C n 2 e -cn/L , where the last bound holds if d ≥ 2 log n. Choosing n ≥ KL log n for a suitable absolute constant K > 0 allows us to simplify the residual terms in the probability bounds to the forms we have claimed. Lemma D.7. There is an absolute constant c > 0 and absolute constants k, k , K, K > 0 such that for any d ≥ K log n, if n ≥ K max{d 4 L, 1}, ∆ ≤ kn -5/2 , and ε ≤ k ∆ 1 + n0 n -1 , then one has for any x ∈ N ε P {SSC(L) holds at x} ∩ L =1 |R ( x, ∆)| ≤ d ≥ 1 -e -cd . Proof. We start by constructing a high-probability event on which we have control of every possible propagation coefficient. For any d ≥ K log n, choosing ∆ ≤ cn -5/2 and n ≥ K d 4 L and applying the first conclusion in Lemma D.6 and a union bound, we have P ∃ ∈ [L], Φ ,0 x 2 → ∞ > C 1 (1 + n 0 n ) ≤ Ce -cd (D.59) and under the same hypotheses, for any 1 ≤ < ≤ L and any S ⊂ [n], the second conclusion in Lemma D.6 gives P Φ , x S ∞ → ∞ > C 2 |S| d n ≤ C e -16d . Using Lemma D.5, we have if n ≥ max{KdL, 4} and ∆ ≤ K /n 5/2 P L =1 |R ( x, ∆)| > d ≤ 2e -d + L 2 e -cn/L . (D.60) Denote the complement of the event in the previous bound as E. On E, there are no more than d levels in the network with risky features. There are at most In addition, there are at most L 2 ways to pick two indices 1 ≤ < ≤ L. Using n ≥ L, this yields at most 4dn 2+2d ≤ n 8d items to union bound over, i.e., P     S⊂[n] |S|≤ d 1≤ < ≤L Φ , x S ∞ → ∞ > C 2 |S| d n     ≤ Ce -8d (D.61) if d ≥ K log n and n ≥ max{K d 4 L, n 0 }. Denote the complement of the union of the events in the bounds (D.59) to (D.61) and E as G; taking additional union bounds and worst-casing absolute constants, we have shown P[G] ≥ 1 -Ce -cd . If we enumerate the levels of the network that have risky features as 1 ≤ 1 < • • • < r ≤ L, it follows from our previous counting argument that on G, we have transfer coefficient bounds (for any ∈ [L] and any i < ) Φ ,0 x 2 → ∞ ≤ C 1 (1 + n 0 n ), Φ , i x R i ( x) ∞ → ∞ ≤ C 2 |R i ( x)| d n . Now we begin the induction. Let x ∈ N ε ( x). For SSC(1), we have from the definitions ρ 1 (x) -ρ 1 ( x) ∞ ≤ ε Φ 1,0 x 2 → ∞ ≤ C 1 (1 + n 0 n )ε, where the last inequality holds on G. So we have SSC(1) on G if ε ≤ ∆(C 1 (1 + n0 n )) -1 . Continuing, we suppose that we have established SSC( -1) on G. We can therefore apply (D.57) together with our transfer coefficient bounds to get ρ (x) -ρ ( x) ∞ ≤ C 1 (1 + n 0 n )ε + C 2 ∆ d n i∈[r] : i< |R i ( x)| ≤ C 1 (1 + n 0 n )ε + C 2 ∆ d 3 n . Notice that the last bound does not depend on . Thus, if we choose ε ≤ ∆(2C 1 (1 + n0 n )) -1 and n ≥ 4C 2 2 d 3 , we obtain ρ (x) -ρ ( x) ∞ ≤ ∆; by induction, we can conclude that SSC(L) holds on G, which implies the claim; we obtain the final simplified probability bound by worsening the constant in the exponent.

D.3.3 UNIFORMIZING FORWARD FEATURES UNDER SSC

Under the SSC(L) condition, we can uniformize forward and backward features. A prerequisite of our approach, which we also used to establish SSC(L) in the previous section, is control of certain residuals that appear when a small number of supports can change off the nominal forward and backward correlations. These estimates are studied in the next section, Appendix D.3.4. In previous sections, most of our results (e.g. Lemma D.1) feature a lower bound of the type n ≥ K in their hypotheses. After uniformizing, we will discard this hypothesis using our extra assumption that n 0 ≥ 3, which gives us a lower bound on the logarithmic terms of the form log(Cnn 0 ) that appear as lower bounds on d after uniformizing, and the fact that our lower bounds on n always involve a polynomial in d. Thus, by adjusting absolute constants, we can achieve the same effect as previously. Lemma D.8. There are absolute constants c, C > 0 and an absolute constant K, K > 0 such that for any d ≥ Kd 0 log(nn 0 C M ), if n ≥ K d 4 L then one has P    x∈N n -3 n -1/2 0 SSC(L, n -3 n -1/2 0 , Cn -3 ) holds at x ∩ L =1 |R ( x, Cn -3 )| ≤ d    ≥ 1 -e -cd . Proof. Following the discussion in Appendix D .3.1, if 0 < ε ≤ 1 then |N ε | ≤ e d0 log(C M /ε) ; to apply Lemma D.7 we at least need ∆ ≤ kn -5/2 and ε ≤ k ∆ 1 + n0 n -1 , so it suffices to put ∆ = Cn -3 when n is chosen larger than an absolute constant, and require ε ≤ min 1, k Cn -5/2 1 + n0 n -1 . Fixing ε = n -3 n -1/2 0 , which again is admissible when n is sufficiently large compared to an absolute constant, for any d ≥ Kd 0 log(nn 0 C M ), we choose n ≥ K max{1, d 4 L} and take a union bound to obtain the claim (using here that C M ≥ 1). Lemma D.9. There is an absolute constant c > 0 and absolute constants K, K , K > 0 such that for any d ≥ Kd 0 log(nn 0 C M ), if n ≥ Kd 4 L, then one has P x∈M ∀ ∈ [L], α (x) 2 -1 ≤ 1 2 ≥ 1 -e -cd . Proof. Let x ∈ N n -3 n -1/2 0 . Lemma D.2 and a union bound give P ∃ ∈ [L] : α ( x) 2 -1 > 1 4 ≤ C L 2 e -cn/L ≤ e -c d if n ≥ KdL and d ≥ K log n. If additionally d ≥ K d 0 log(nn 0 C M ) , we obtain by the discussion in Appendix D.3.1 and another union bound P    x∈N n -3 n -1/2 0 ∃ ∈ [L] : α ( x) 2 -1 > 1 4    ≤ e -cd . Let E denote the event studied in Lemma D.8; choose d ≥ Kd 0 log(nn 0 C M ) and n sufficiently large to make the measure bound applicable. A union bound gives P   E c ∪ x∈N n -3 n -1/2 0 ∃ ∈ [L] : α ( x) 2 -1 > 1 4    ≤ e -cd . Let G denote the complement of the event in the previous bound. For any x ∈ M, we can find a point x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x). On G, SSC(L, n -3 n -1/2 0 , Cn -3 ) holds at every point in the net N n -3 n -1/2 0 , which implies that on G ∀ ∈ [L], ρ (x) -ρ ( x) ∞ ≤ C n 3 , (D.62) and by the 1-Lipschitz property of [ • ] + and Lemma G.10, this also implies that on G ∀ ∈ [L], α (x) -α ( x) 2 ≤ C n 5/2 . Choosing n ≥ (4C) 2/5 , the RHS of this bound is no larger than 1/4. We write using the triangle inequality α (x) 2 -1 ≤ α (x) 2 -α ( x) 2 + α ( x) 2 -1 . Using the triangle inequality again, the first term on the RHS is no larger than 1/4 for any on G. The second term is also no larger than 1/4 on G by control over the net. We conclude that on G ∀ ∈ [L], α (x) 2 -1 ≤ 1 2 . This implies that the event G is contained in the set x∈M ∀ ∈ [L], α (x) 2 -1 ≤ 1 2 , which is closed, by continuity of • 2 and of the features as a function of the parameters, and is therefore also an event. The claim follows. Lemma D.10. There are absolute constants c, C > 0 and absolute constants K, K > 0 such that for any d ≥ Kd 0 log(nn 0 C M ), if n ≥ K d 4 L, then one has P   (x,x )∈M×M ∀ ∈ [L], α (x), α (x ) -cos ϕ ( ) (∠(x, x )) ≤ C d 3 n   ≥ 1 -e -cd . Proof. Let x, x ∈ N n -3 n -1/2 0 . Lemma D. 1 and a union bound give  P ∃ ∈ [L] : α ( x), α ( x ) -cos ϕ ( ) (∠( x, x )) > C d 3 n ≤ C Le -cd ≤ e -c d if d ≥ K log n P      ( x, x )∈N ×2 1 n 3 √ n 0 ∃ ∈ [L] : α ( x), α ( x ) -cos ϕ ( ) (∠( x, x )) > C d 3 n      ≤ e -cd , where with an abuse of notation we write S ×2 to denote S × S for a set S. Let E 1 denote the event studied in Lemma D.8, and let E 2 denote the event studied in Lemma D.9; choose n sufficiently large to make the measure bounds applicable. A union bound gives P      E c 1 ∪ E c 2 ∪ ( x, x )∈N ×2 1 n 3 √ n 0 ∃ ∈ [L] : α ( x), α ( x ) -cos ϕ ( ) (∠( x, x )) > C d 3 n      ≤ e -cd after adjusting constants. Let G denote the complement of the event in the previous bound. For any (x, x ) ∈ M × M, we can find a point x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x) and a point x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x ). On G, SSC(L, n -3 n -1/2 0 , Cn -3 ) holds at every point in the net N n -3 n -1/2 0 , which implies that on G ∀ ∈ [L], ρ (x) -ρ ( x) ∞ ≤ C n 3 , and ∀ ∈ [L], ρ (x ) -ρ ( x ) ∞ ≤ C n 3 , and by the 1-Lipschitz property of [ • ] + and Lemma G.10, this also implies that on G ∀ ∈ [L], α (x) -α ( x) 2 ≤ C n 5/2 , and ∀ ∈ [L], α (x ) -α ( x ) 2 ≤ C n 5/2 . (D.63) For any ∈ [L], we write using the triangle inequality α (x), α (x ) -cos ϕ ( ) (∠(x, x )) ≤ α (x), α (x ) -α ( x), α (x ) + α ( x), α (x ) -α ( x), α ( x ) + α ( x), α ( x ) -cos ϕ ( ) (∠( x, x )) + cos ϕ ( ) (∠( x, x )) -cos ϕ ( ) (∠(x, x )) . (D.64) Using Cauchy-Schwarz, we have on G α (x), α (x ) -α ( x), α (x ) ≤ α (x) -α ( x) 2 α (x ) 2 ≤ 2C n 5/2 , with the same bound holding for the second term in (D.64) by an analogous argument. For the third term, we have on G α ( x), α ( x ) -cos ϕ ( ) (∠( x, x )) ≤ C d 3 n . For the last term, we use 1-Lipschitzness of cos and 1-Lipschitzness of the ϕ ( ) , which follows from Lemma E.5 and the chain rule, to obtain cos ϕ ( ) (∠( x, x )) -cos ϕ ( ) (∠(x, x )) ≤ |∠( x, x ) -∠(x, x )|. Using Lemma C.7 and several applications of the triangle inequality, we get |∠( x, x ) -∠(x, x )| ≤ √ 2| x -x 2 -x -x 2 | ≤ √ 2 (x -x ) -( x -x ) 2 ≤ √ 2 x -x 2 + √ 2 x -x 2 ≤ 2 √ 2 n 3 , and so returning to (D.64), we have shown α (x), α (x ) -cos ϕ ( ) (∠(x, x )) ≤ C d 3 n + C n 5/2 ≤ (C + C ) d 3 n for every ∈ [L]. This implies that the event G is contained in the set (x,x )∈M×M ∀ ∈ [L], α (x), α (x ) -cos ϕ ( ) (∠(x, x )) ≤ C d 3 n , which is closed, by continuity of the inner product and of the features as a function of the parameters, and is therefore also an event. The claim follows. Lemma D.11. Assume n, L, d satisfy the requirements of lemma D.10 and additionally d ≥ 1, n ≥ K √ L for some K. Then P f θ0 L ∞ ≤ √ d ≥ 1 -e -cd , P ζ L ∞ ≤ √ d ≥ 1 -e -cd . Define ζ(x) = -f (x) + M f θ0 (x )dµ ∞ (x ). Then under the same assumptions P   ζ -ζ L ∞ ≤ d L 2 + d 5/2 L n   ≥ 1 -e -cd for some numerical constant c. Proof. At some x ∈ M, we note that f θ0 (x) = W L+1 α L (x) d = g α L (x) 2 (D.65) where g is a standard normal independent of the other variables in the problem. Similarly f θ0 (x) - M f θ0 (x )dµ ∞ (x ) = W L+1   α L (x) - M α L (x )dµ ∞ (x )   d = g α L (x) - M α L (x )dµ ∞ (x ) 2 , (D.66) where g is also standard normal. With respect to the randomness of W L+1 , these two objects are Gaussian processes with variances α L (x) 2 2 and α L (x) - M α L (x )dµ ∞ (x ) 2 2 respectively. We next note that α L (x) - M α L (x )dµ ∞ (x ) 2 2 = M α L (x) -α L (x )dµ ∞ (x ) 2 2 ≤ M α L (x) -α L (x ) 2 2 dµ ∞ (x ) = M α L (x) 2 2 + α L (x ) 2 2 -2 α L (x), α L (x ) dµ ∞ (x ) ≤ sup x∈M α L (x) 2 2 + α L (x ) 2 2 -2 α L (x), α L (x ) ≤ sup x∈M      α L (x) 2 2 + α L (x ) 2 2 -2 α L (x), α L (x ) -(2 -2 cos ϕ (L) (ν(x, x ))) + 2 -2 cos ϕ (L) (ν(x, x ))      where the first inequality comes from an application of Jensen's inequality. Assuming n, d satisfy the requirements of lemma D.10 and denote the event defined in it by G. On G, angles between features concentrate uniformly around a simple function of the angle evolution function ϕ, in the sense that, for all x ∈ M simultaneously, α L (x) 2 2 + α L (x ) 2 2 -2 α L (x), α L (x ) -(2 -2 cos ϕ (L) (ν(x, x ))) ≤ α L (x) 2 2 -1 + α L (x ) 2 2 -1 + 2 α L (x), α L (x ) -cos ϕ (L) (ν(x, x )) ≤C d 3 L n (D.67) for some constant C . From lemma C.12, there exists a constant c 0 > 0 such that for all ν ∈ [0, π], 0 ≤ ϕ (L) (ν) ≤ 1 c 0 L . Using cos x ≥ 1 -x 2 /2 and the above bound gives 1 ≥ cos ϕ (L) (ν) ≥ 1 - (ϕ (L) (ν)) 2 2 ≥ 1 - 1 2c 2 0 L 2 and thus 2 -2 cos ϕ (L) (ν(x, x )) ≤ 1 c 2 0 L 2 . Combining the above bound with D.67 and recalling the probability of G holding, we have P sup x∈M α L (x) 2 2 + α L (x ) 2 2 -2 α L (x), α L (x ) > 1 c 2 0 L 2 + C d 3 L n ≤ e -cd . (D.68) On the same event we have P sup x∈M α L (x) 2 2 > 2 ≤ e -cd . Thus on G the variances of the Gaussian processes (D.65), (D.66) are uniformily bounded by 2 and 1 c 2 0 L 2 + C d 3 L n respectively. Writing for concision in the subsequent expression E *    f θ0 (x) - M f θ0 (x )dµ ∞ (x ) ≤ √ d 2 1 L 2 + d 3 L n    taking a union bound over all points on the net N n -3 n -1/2 0 gives P    x∈N n -3 n -1/2 0 |f θ0 (x)| ≤ √ d 2 ∩ E * G    ≥1 -N n -3 n -1/2 0 e -cd ≥ 1 -e -c d , (D.69) for some constants, since d was chosen to satisfy the conditions of lemma D.10. In addition, we see from (D.63) that on the same event, for every x ∈ M there exists x ∈ N n -3 n -1/2 0 such that |f θ0 (x) -f θ0 (x)| = f θ0 (x) - M f θ0 (x )dµ ∞ (x ) -f θ0 (x) - M f θ0 (x )dµ ∞ (x ) = W L+1 α L (x) -α L (x) ≤ W L+1 2 α L (x) -α L (x) 2 ≤ C W L+1 2 n 5/2 . Bernstein's inequality also gives P W L+1 2 > C √ n ≤ e -cn for some constants. Denoting the complement of this event by E we have that on E ∩ G, for every x ∈ M there exists x ∈ N n -3 n -1/2 0 such that f θ0 (x) -f θ0 (x) 2 ≤ C n 2 ≤ √ d 2 f θ0 (x) - M f θ0 (x )dµ ∞ (x ) -f θ0 (x) - M f θ0 (x )dµ ∞ (x ) 2 ≤ C n 2 ≤ √ d 2 1 L 2 + d 3 L n where we assumed d ≥ 1, n ≥ K √ L for some K. Combining the above bound with (D.69) and taking a union bound over the complements of E, G gives P   x∈M |f θ0 (x)| ≤ √ d ∩    f θ0 (x) - M f θ0 (x )dµ ∞ (x ) ≤ d L 2 + d 5/2 L n      ≥1 -e -cd -P [G c ] -P [E c ] ≥1 -e -c d -e -c n ≥ 1 -e -c d . From the first result, we also obtain that with the the same probability ζ L ∞ ≤ 1 + √ d. By worsening the constant in the tail we can simplify this to ζ L ∞ ≤ √ d. Defining ζ(x) = -f (x) + M f θ0 (x )dµ ∞ (x ), since ζ(x) -ζ(x) = f θ0 (x) - M f θ0 (x )dµ ∞ (x ), it follows that P   ζ -ζ L ∞ ≤ d L 2 + d 5/2 L n   ≥ 1 -e -c d . Lemma D.12. For some integer d 0 , assume n, L, d satisfy the requirements of lemmas D.8 and D.14, meaning that there exist absolute constants K, K , K , K , K > 0 such that for any d ≥ max{Kd 0 log(nn 0 C M ), K log L}, if n ≥ max{K d 4 L, K L log n, K } then 1. If d 0 = 1 and n ≥ K max d 2 L, κ cτ 1/3 , κ 2/5 where K is some absolute constant. κ and c λ are the extrinsic curvature and injectivity coefficient defined in Section 2.1, then on an event of probability at least 1 -e -cd , one has f θ0 M± Lip ≤ √ d for a numerical constant c, and where the Lipschitz seminorm is taken with respect to the Riemannian distance on M ± . 2. If M = S n0-1 so that d 0 = n 0 -1, then on an event of probability at least 1 -e -cd , one has f θ0 S n 0 -1 Lip ≤ √ d for a numerical constant c. Proof. We recall f θ0 (x) = W L+1 α L (x). Let M denote a connected component of M. Let x 1 , x 2 ∈ M , and fix a smooth unit-speed geodesic γ : [0, dist M (x 1 , x 2 )] → M such that γ(0) = x 1 and γ(dist M (x 1 , x 2 )) = x 2 . The absolute continuity of f θ0 M± • γ follows from an argument almost identical to the one employed in the proof of Lemma B.8, and we obtain in particular the bound |f θ0 (x 1 ) -f θ0 (x 2 )| = dist M (x1,x2) 0 γ (t), W 1 * β 0 (γ(t)) dt . Because γ is a unit-speed geodesic, we have for all t P T γ(t) M γ (t) = (γ (t)γ (t) * ) γ (t) = γ (t), and so in particular, by the triangle inequality and Cauchy-Schwarz |f θ0 (x 1 ) -f θ0 (x 2 )| ≤ dist M (x 1 , x 2 ) sup x∈M P TxM W 1 * β 0 (x) 2 . (D.70) This implies f θ0 M Lip ≤ sup x∈M P TxM W 1 * β 0 (x) 2 . We next write W 1 = W 1 xx * + W 1 (I -xx * ) = G 1 + H 1 . T x M can be identified with a linear subspace of R n0 of dimension d 0 . Since it is also a subspace of T x S n0-1 , P TxM x = 0 and hence P TxM W 1 * = P TxM H 1 * d = P TxM (I -xx * ) W 1 * = P TxM W 1 * where W 1 * is a copy of W 1 * that is independent of all the other variables in the problem (since β 0 (x) depends only on G 1 ). We first consider the case d 0 = n 0 -1, and subsequently the case d 0 = 1. We note that P TxM W 1 * β 0 (x) 2 2 ≤ W 1 * β 0 (x) 2 2 = n0 i=1 W 1 * i β 0 (x) 2 , (D.71) Considering a single summand in the above expression, repeated application of the rotational invariance of the gaussian distribution gives W 1 * i β 0 (x) 2 d = 2 β 0 (x) 2 2 n g 2 i,I(x) = 2 n W L+1 Γ L:2 (x)P I1(x) 2 2 g 2 i,I(x) ≤ 2 n W L+1 Γ L:2 (x) 2 2 g 2 i,I(x) d = 2 n Γ L:2 (x)W L+1 2 2 g 2 i,I(x) d = 2 n Γ L:2 (x)e 1 2 2 W L+1 2 2 g 2 i,I(x) (D.72) where g i,I(x) is a standard normal variable that depends on i and the support patterns I(x) = {I 1 (x), . . . , I L (x)}, since β 0 (x) depends on x only through I(x) Similarly, the dependence on x in the first factor in (D.72) is only through I(x). We now show how to control such terms uniformly on M . Define a net N n -3 n -1/2 0 as in Appendix D.3.1. According to Lemma C.4, N n -3 n -1/2 0 ≤ e 3 log(C M nn0)d0 . Assume n, L, d satisfy the requirements of Lemma D.8 and denote this event defined in that lemma by E. We also define sets of support patterns J (x) = J = {J 1 , . . . , J L } L =1 |J I (x)| ≤ d , (D.73) J (N n -3 n -1/2 0 ) = x∈N n -3 n -1/2 0 J (x). On E, x∈M {I(x)} ⊆ J (N n -3 n -1/2 0 ), and additionally for any x ∈ M , then there exists some x ∈ N n -3 n -1/2 0 such that x ∈ N n -3 n -1/2 0 (x) and I(x) ∈ J (x) . We now show that on E, the supports I(x) satisfy the requirements of Lemma D.14 with δ s = d, K s = Cn -5/2 , with the anchor point in the statement of that lemma chosen to be x. The value of δ s is satisfied by the supports at every point on the manifold on E from the definition of this event. From the definition of the stable sign consistency property (SSC) in Appendices D.3.1 and D.3.2, the only features whose sign can differ between x to x are the risky features, and from the bound on their norm in the definition of SSC(L, n -3 n -1/2 0 , Cn -3 ) we obtain for all P J -P I (x) ρ (x) L ∞ ≤ C n 3 ⇒ P J -P I (x) ρ (x) 2 ≤ C n 5/2 where in the last inequality we used Lemma G.10. It follows that if n ≥ KLd for some K, the requirements of Lemma D.14 are satisfied if we make the choice E = E δK . We would next like to apply Lemma D.14 for every possible support pattern in J (N n -3 n -1/2 0 ) , which requires that we first bound the cardinality of this set. Note that J (x) is the . Thus J (x) ≤ d i=0 Ln i ≤ d eLn d d ≤ (Ln) Cd for some C. Using the bound on the cardinality of the net from Lemma C.4, the size of this set can be bounded, giving J (N n -3 n -1/2 0 ) ≤ N n -3 n -1/2 0 d i=0 n i ≤ e 3 log(C M nn0)d0+C log(Ln)d . (D.74) We can now apply Lemma D.14 with E = E δK to bound Γ :2 (x)e 1 2 2 for all on E, taking a union bound over all possible supports. Bernstein's inequality and an exponential tail bound can be used to bound the second factor and third factors in (D.72) respectively. Using (D.74) to bound the number of supports we need to uniformize over (since on E, x∈M {I(x)} ⊆ J (N n -3 n -1/2 0 )) and the bound on the size of the net in Lemma C.4 and Appendix D.3.1, we obtain P ∀x ∈ M : 2 n 1 E Γ L:2 (x)e 1 2 2 W L+1 2 2 g 2 i,I(x) ≤ d =P ∀x ∈ N n -3 n -1/2 0 , ∀x ∈ N n -3 n -1/2 0 (x) : 2 n 1 E Γ L:2 (x)e 1 2 2 W L+1 2 2 g 2 i,I(x) ≤ d ≥1 -N n -3 n -1/2 0 e C log(n)d e -c n L -N n -3 n -1/2 0 e C log(n)d e -c d -e -c n ≥1 -e 3 log(C M nn0)d0+C log(n)d-c n L -e 3 log(C M nn0)d0+C log(n)d-c d -e -c n ≥1 -e -c d (D.75) where we assume d ≥ K log(C M nn 0 )d 0 , n ≥ K Ld 2 for some K, K . Since according to Lemma D.8 the event E holds with probability greater than 1 -e cd for some c, we can remove the indicator in the bound above by assuming d ≥ K for some absolute constant K and worsening the constant in the bound. We can now complete the proof for d 0 = n 0 -1. Since we are interested in bounding the sum (D.71) uniformly, we can bound n0 i=1 g 2 i,I(x) as well using Bernstein's inequality and uniformizing as above obtain P ∀x ∈ M : 2 n 1 E Γ L:2 (x)e 1 2 2 W L+1 2 2 n0 i=1 g 2 i,I(x) ≤ C(n 0 + d) =P ∀x ∈ N n -3 n -1/2 0 , ∀x ∈ N n -3 n -1/2 0 (x) : 2 n 1 E Γ L:2 (x)e 1 2 2 W L+1 2 2 n0 i=1 g 2 i,I(x) ≤ C(n 0 + d) ≥1 -N n -3 n -1/2 0 e C log(n)d e -c n L -N n -3 n -1/2 0 e C log(n)d e -c d -e -c n ≥1 -e 3 log(C M nn0)d0+C log(n)d-c n L -e 3 log(C M nn0)d0+C log(n)d-c d -e -c n ≥1 -e -c d where we assume d ≥ K log(C M nn 0 )d 0 , n ≥ K Ld 2 for some K, K . As above, we can remove the indicator in the bound above by assuming d ≥ K for some absolute constant K and worsening the constant in the bound. Worsening constants in the failure probability, we can replace the residual in the above expression by d. Using (D.70), we obtain that with the same probability the Lipschitz constant of f θ0 on S n0-1 is bounded by √ d. We now consider d 0 = 1. Recall that for any x ∈ M , then there exists some x ∈ N n -3 n -1/2 0 such that x ∈ N n -3 n -1/2 0 (x) where N n -3 n -1/2 0 is the net defined earlier. The gradient vector at x takes the form P TxM W 1 * β 0 (x) = P T x M W 1 * β 0 (x) + (P TxM -P T x M ) W 1 * β 0 (x). (D.76) We proceed to bound the first term in the above equation uniformly over M . Since in the d 0 = 1 case T x M can be identified with a linear subspace of R n0 of dimension 1, we can write the projection operator as P T x M = v x v * x (D.77) for some unit norm v x . We then obtain from rotational invariance of the gaussian distribution that P T x M W 1 * β 0 (x) 2 2 = v x v * x W 1 * β 0 (x) 2 2 d = 2 β 0 (x) 2 2 n g 2 x ≤ 2 n Γ L:2 (x)e 1 2 2 W L+1 2 2 g 2 x , where g x is a standard normal variable and the last bound comes from a calculation identical to (D.72). Under the same assumptions on d, L, n as before, we can bound this expression uniformly using (D.75), and additionally take a union bound over the net to account for all possible choices of x. This gives P ∀x ∈ N n -3 n -1/2 0 , ∀x ∈ N n -3 n -1/2 0 (x) : 2 n 1 E Γ L:2 (x)e 1 2 2 W L+1 2 2 g 2 x ≤ d ≥1 -N n -3 n -1/2 0 e -c d ≥1 -e 3 log(C M nn0)d0-c d ≥1 -e -c d where we assume d ≥ K log(C M nn 0 ) for some K. This gives control of the first term in (D.76) uniformly on M. We now turn to controlling the second term in (D.76). For some x, choose x ∈ N n -3 n -1/2 0 such that x ∈ N n -3 n -1/2 0 (x). Define a unit-speed curve γ : [0, s] → M such that γ(0) = x, γ(s) = x. Since the curvature of M is bounded by κ, we have ∀s ∈ [0, s] : γ (s ) 2 ≤ κ. Denote by r the geodesic distance between x and x . Since the euclidean distance between them is bounded by n -3 n -1/2 0 , assuming n > K for some K implies that r < C n -3 n -1/2 0 for some C . If we now demand n 3 ≥ C κ c λ which implies C n 3 n 1/2 0 ≤ cτ κ , (A.2) gives s ≤ C n 3 n 1/2 0 , for some C > 0. For v x , v x defined as in (D.77) we have γ (0) = v x , γ (s) = v x . Combining the previous two results, it follows that v x -v x 2 = γ (0) -γ (s) 2 = s s =0 γ (s )ds 2 ≤ sκ ≤ Cκ n 3 n 1/2 0 . A straightforward calculation then gives P TxM -P T x M = v x v x * -v x v x * = 1 2 v x -v x 2 v x + v x 2 ≤ v x -v x 2 ≤ Cκ n 3 n 1/2 0 . If we now use Lemma D.13 to control the norms of the backward features uniformly, a standard bound on the norm of a Gaussian matrix to give P W 1 > C 1 + n0 n ≤ e -cn , and assume n ≥ κ 2/5 we obtain that P ∀x ∈ N n -3 n -1/2 0 , x ∈ N n -3 n -1/2 0 (x) : (P TxM -P T x M ) W 1 * β 0 (x) ≤ C ≥ 1 -e -cd -e -c n ≥ 1 -e -c d . Combining the above with (D.75) and using (D.76) and taking a union bound over the failure probability of E which results in a worsening of constants completes the proof. We can additionally rescale d to obtain a final bound on the Lipschitz constant of √ d instead of C √ d, which also results in a worsening of constants. Lemma D.13. There are absolute constants c, C > 0 and absolute constants K 1 , . . . , K 4 > 0 such that for any d ≥ Kd 0 log(nn 0 C M ), if n ≥ K d 4 L, then there exists an event E such that 1. On E, we have ∀ ∈ [L], β -1 (x), β -1 (x ) - n 2 L-1 = -1 1 - ϕ ( ) (∠(x, x )) π ≤ C √ d 4 nL simultaneously for every (x, x ) ∈ M × M; 2. P[E] ≥ 1 -e -cd . Proof. Let E 1 denote the event studied in Lemma D.8, with C 0 denoting the absolute constant appearing in the SSC(L) condition there; choose d ≥ Kd 0 log(nn 0 C M ) and n sufficiently large to make the measure bound applicable. We will need to apply Lemma D.23 together with a derandomization argument to prove the claim; we appeal to the same residual checks at the beginning of the proof of Lemma D.6 to see that on E 1 , the dominating residual in Lemma D.23 under the scalings of d and n we enforce here is of size C √ d 4 nL. For any subset S ⊂ [L] × [n], we write S = {i ∈ [n] | ( , i) ∈ S}, and we define S(S) = {-1, +1} |S1| × • • • × {-1, +1} |S L | for the set of "lists" of sign patterns with sizes adapted to these projections of S, with the convention {-1, +1} 0 = {0}. If Σ = {σ 1 , . . . , σ L } ∈ S(S) is such a list of sign vectors and ∆ ≥ 0, we define Ĩ (x, S, Σ, ∆) = supp 1 ρ (x)> i∈S ((σ )i∆)ei , which is a sort of two-sided robust analogue of the support of α (x): notice that when S = ∅ we have Ĩ (x, S, Σ, ∆) = I (x). We also define for = 0, 1, . . . , L -1 β S,Σ,∆ (x) = W L+1 P ĨL (x,S,Σ,∆) W L P ĨL-1 (x,S,Σ,∆) . . . W +2 P Ĩ +1 (x,S,Σ,∆) * , a generalized backward feature induced by these robust support patterns. Writing for concision S x, x ,S,S ,Σ,Σ =                ∃ ∈ [L] : β -1 S,Σ,C0n -3 ( x), β -1 S ,Σ ,C0n -3 ( x ) -n 2 L-1 = -1 1 -ϕ ( ) (∠( x, x )) π > C 1 d 4 nL log 4 n                , where C 1 > 0 is an absolute constant we will specify below to make the event hold with high probability, we then define the eventfoot_11  E 2 = x∈N n -3 x ∈N n -3 S⊂[L]×[n] S ⊂[L]×[n] |S|≤d,|S |≤d Σ∈S(S) Σ ∈S(S ) S x, x ,S,S ,Σ,Σ , There are no more than d k=0 nL k ≤ n 4d ways to choose the subset S in this union, and for a fixed S there are no more than 2 d ways to choose the sign pattern Σ. Thus, there no more than exp(10d log n+12d 0 log(nn 0 C M )) elements in the union, and under the condition on d this number is no larger than n 11d . For concision, write ξ ( x, x ) = n 2 L-1 = 1 - ϕ ( ) (∠( x, x )) π . For any instantiation of these parameters, Lemma D.23 and a union bound give P ∃ ∈ [L] : β -1 S,Σ,C0n -3 ( x), β -1 S ,Σ ,C0n -3 ( x ) -ξ -1 ( x, x ) > C √ d 4 nL ≤ P[E c 1 ] + P ∃ ∈ [L] : 1 E1 β -1 S,Σ,C0n -3 ( x), β -1 S ,Σ ,C0n -3 ( x ) -ξ -1 ( x, x ) > C √ d 4 nL ≤ e -cd for any d ≥ K log n and n ≥ K d 4 L. Thus, if we set C 1 = C and enforce d ≥ Kd 0 log(nn 0 C M )/ log n and n ≥ max K d 4 L log 4 n, we have by a union bound P[E 1 ∪ E 2 ] ≤ n -cd . Let G = E c 1 ∩ E c 2 . For any (x, x ) ∈ M × M, we can find a point x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x) and a point x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x ). On G, SSC(L, n -3 n -1/2 0 , Cn -3 ) holds at every point in the net N n -3 n -1/2 0 , and there are no more than d Cn -3 -risky features at any point in the net N n -3 n -1/2 0 , and in addition, following (D.52), we have almost surely on G that all risky features are realized for magnitudes in (-∆, +∆). This implies that on G, the support sets ∈[L] I (x) at any point x ∈ N n -3 n -1/2 0 ( x) differ by the support sets ∈[L] I ( x) at the base point in the net by no more than d entries, consisting only of a subset of the risky features at x; the analogous statement is of course true for x and x . At the same time, notice that on the event E c 2 we have constructed, we have control of every possible backward feature inner product obtained by modifying the supports at the base points x, x at no more than d risky features (each), since, for example, if (ρ ( x)) i is risky, then 1 (ρ ( x))i>∆ corresponds to "turning off" the feature, and 1 (ρ ( x))i>-∆ corresponds to "turning on" the feature. Formally, we have established that on G ∀ ∈ [L], β -1 (x), β -1 (x ) - n 2 L-1 = -1 1 - ϕ ( ) (∠( x, x )) π ≤ C 1 d 4 nL log 4 n. We can use differentiability properties for the remaining link: following the proof of Lemma D.10, we have |∠( x, x ) -∠(x, x )| ≤ √ 2 x -x 2 + √ 2 x -x 2 ≤ 2 √ 2 n 3 , so we just need a Lipschitz property for the function q(ν) = (n/2) L-1 = (1 -π -1 ϕ ( ) (ν)) . For this we appeal to Lemma E.5, which shows that the function ϕ is smooth, increasing and concave; therefore by the chain rule, the functions ϕ ( ) are increasing and concave, and by the Leibniz rule, q is decreasing and convex. It therefore suffices to calculate q (0); this is done in Lemma C.18, which gives q (0) = -n(L -)/(2π), and in particular |q (0)| ≤ cnL. It follows n 2 L-1 = 1 - ϕ ( ) (∠( x, x )) π - n 2 L-1 = 1 - ϕ ( ) (∠(x, x )) π ≤ cL n 2 , so that by the triangle inequality ∀ ∈ [L], β -1 (x), β -1 (x ) - n 2 L-1 = -1 1 - ϕ ( ) (∠(x, x )) π ≤ 2C 1 d 4 nL log 4 n, where the residual simplification is valid when n ≥ KL. We conclude that the set (x,x )∈M×M ∀ ∈ [L], β -1 (x), β -1 (x ) -ξ -1 ( x, x ) ≤ 2C 1 d 4 nL log 4 n contains the event G, which satisfies the claimed properties and completes the proof (after rescaling d by 1/ log n, which updates the lower bound on d).

D.3.4 SMALL SUPPORT CHANGE RESIDUALS

In this section, we prove generalized versions of our pointwise concentration lemmas for backward feature correlations and the matrices defining the propagation coefficients used in our study of SSC(L). Lemma D.14. Assume n ≥ max {KL log n, K Ld, K }, d ≥ K log L for suitably chosen K, K , K , K and integer L, and choose 1 ≤ ≤ ≤ L. Define an anchor point x ∈ M and denote I i (x) = supp α i 0 (x) > 0 for ≤ i ≤ . Choose some δ s , K s > 0 and let J = {J , . . . , J } denote a collection of support sets such that each J i ⊂ [n] depends on the network parameters only through the pre-activation ρ i 0 (x). We define events implying that the supports at J are close to those at x: E δ = ≤i≤ {|J i I i (x)| ≤ δ s } , E K = ≤i≤ P Ji -P Ii(x) ρ i (x) 2 ≤ K s , E δK = E δ ∩ E K . Define Γ : J = P J W P J -1 . . . P J W , and fix a unit norm vector v f . If K s ≤ 1 2 L -3/2 , δ s ≤ n L , then P 1 E δK Γ : J v f 2 ≤ C ≥ 1 -e -c n L , and P 1 E δK Γ : J ≤ C √ L ≥ 1 -e -c n L . For a vector g, g i ∼ iid N (0, 1), defining H i = W i I -α i-1 (x)α i-1 (x) * for i ∈ [L] and Γ : HJ = P J H P J -1 . . . P J H we have P 1 E δK Γ : J -Γ : HJ g 2 > C √ dL ≤ e -cd for some numerical constants c, C. Proof. In the following, we will denote by v f ∈ S n-1 a fixed unit norm vector and by v u ∈ S n-1 a random vector uniformly distributed on S n-1 . When there is no need to distinguish between the two we will denote either by v p . Our strategy in bounding 1 E δK Γ : J will be first to bound 1 E δK Γ : J v f 2 with sufficiently high probability, and then apply an ε-net argument to uniformize the result (lemma D.20) and get control of the operator norm. In achieving the first goal, we will rely heavily on a decomposition of the weight matrices into terms that are conditionally independent given the pre-activations. We will also utilize martingale concentration to control the terms that result from this decomposition. Denoting S i = span α i (x) for i ∈ [L], we decompose the weight matrices into W i = W i P S i-1 + W i P S i-1⊥ . = G i + H i . Note that H 1 , . . . , H L are conditionally independent given σ(G 1 , . . . , G L ) (by which we denote the sigma algebra generated by G 1 , . . . , G L ). Since the pre-activations obey and the features are deterministic functions of the pre-activations, both α 1 (x), . . . , α L (x) and ρ 1 (x), . . . , ρ L (x) are measurable with respect to σ(G 1 , . . . , G L ). We define events E ρ = i= ρ i (x) 2 ≤ C , E δKρ = E δK ∩ E ρ (D.78) and aim to control 1 E δKρ Γ : J (x) . Since the supports J depends on the weights only through the pre-activations and are thus also σ(G 1 , . . . , G L )-measurable, this truncation does not affect the conditional independence of H , . . . , H . It will often be convenient to utilize the rotational invariance of the Gaussian distribution to replace all occurrences of H i in a given expression by W i P S i-1⊥ where W i is a fresh copy of W i independent of all the other variables in the problem, which will not change the distribution of the original expression. For ≤ i ≤ , ≤ j ≤ i + 1 it will also be useful to denote Γ i:j HJ = P Ji H i P Ji-1 . . . P Jj H j , Γ i:j GJ = P Ji G i P Ji-1 . . . P Jj G j where we use the convention Γ i:i+1 GJ = Γ i:i+1 HJ = I. Decomposing the weight matrices at every layer gives Γ : J v p 2 = P J G + H . . . P J G + H v p 2 ≤ (M ,...,M )∈(G ,H )⊗,...,⊗(G ,H ) P J M . . . P J M v p 2 . (D.79) We next define Q i (x) = P Ji -P Ii(x) . (D.80) In accounting for all the terms in the decomposition (D.79), there will be two simplifications that we use repeatedly. One is H i+1 P Ji ρ i (x) = W i+1 I - α i (x)α i (x) * α i (x) 2 2 P α i (x)>0 + Q i (x) ρ i (x) = H i+1 Q i (x)ρ i (x) (D.81) where we used P α i (x)>0 ρ i (x) = ρ i (x) + = α i (x), from which it follows that H i+1 P Ji G i =W i+1 I - α i (x)α i (x) * α i (x) 2 2 P α i (x)>0 + Q i (x) ρ i (x)α i-1 (x) * α i-1 (x) 2 2 =H i+1 Q i (x)G i . (D.82) We also have G i+1 P Ji G i =G i+1 P Ii(x) + Q i (x) G i =W i+1 α i (x)α i (x) * α i (x) 2 2 P Ii(x) + Q i (x) W i α i-1 (x)α i-1 (x) * α i-1 (x) 2 2 = α i (x) 2 α i-1 (x) 2 1 + α i (x) * α i (x) 2 2 Q i (x)W i α i-1 (x) W i+1 α i (x)α i-1 (x) * α i (x) 2 α i-1 (x) 2 . =s i W i+1 α i (x)α i-1 (x) * α i (x) 2 α i-1 (x) 2 , and thus Γ i:j GJ = i-1 k=j s k P Ji W i α i-1 (x)α j-1 (x) * α i-1 (x) 2 α j-1 (x) 2 = i-1 k=j s k P Ji ρ i (x)α j-1 (x) * α i-1 (x) 2 α j-1 (x) 2 . (D.83) We refer to such a product as a G-chain. We proceed to expand (D.79) into terms with different combinations of matrices Γ i:j GJ and Γ i:j HJ . There will be 2 -terms in total, and we denote the set of terms with r G-chains by G r,p (with the subscript p ∈ {u, f } denoting the choice of vector v p ). We can clearly label each term by the start and end index of each G-chain, which may not be distinct. We denote each such term by g rp (i1,i2,...,i2r) where ≤ i 1 ≤ i 2 ≤ i 3 -2 ≤ i 4 -2 < i 5 -4 ≤ • • • ≤ i 2m-1 -2m + 2 ≤ i 2m -2m + 2 ≤ . . . ≤ i 2r-1 -2r + 2 ≤ i 2r -2r + 2 ≤ -2r + 2. (D.84) The constraints above ensure that every two G-chains are separated by at least one H i matrix. To lighten notation, we denote a set of indices obeying the constraints by (i 1 , . . . , i 2r ) ∈ C r ( , ). The maximal number of G-chains possible is bounded by r ≤ ( -) /2 . Since the g rp (i1,i2,...,i2r) are non-negative, we have  1 E δKρ Γ : J v p 2 ≤ 1 E δKρ Γ : HJ v p 2 + ( -)/2 r=1 (i1,...,i2r)∈Cr( , ) g r,p = Γ :i2r+1 HJ Γ i2r:i2r-1 GJ Γ i2r-1-1:i2r-2+1 HJ ...Γ i4:i3 GJ Γ i3-1:i2+1 HJ Γ i2:i1 GJ Γ i1-1: HJ v p 2 = 1 E δKρ Γ :i2r+1 HJ P J i 2r ρ i2r (x) α i2r-1 (x) 2 2 . =ãi 2r * r-1 m=1 1 E δKρ i2m+2-1 k=i2m+1+1 s k α i2m+1 (x) * Γ i2m+1-1:i2m+1 HJ P J i 2m ρ i2m (x) α i2m+1 (x) 2 α i2m-1 (x) 2 . = bi 2m+2 ,i 2m+1 ,i 2m * 1 E δKρ i2-1 k=i1+1 s k α i1 (x) * Γ i1-1: HJ v p α i1 (x) 2 . =ci 2 ,i 1 =ã i2r r-1 m=1 bi2m+2,i2m+1,i2m cp i2,i1 (D.86) The magnitudes the factors in this expression are bounded in the following lemma: Lemma D.15. For ãk , bqij , cp ts defined in (D.86), R u = d n , R f = 1 and ≤ k < , + 2 ≤ j + 2 ≤ i ≤ q ≤ , < s ≤ t ≤ ã ≤ C a.s., P [ã k > K s ] ≤ C e -c n L , P bqij > Ks √ L ≤ C e -c n L , P [|c p t | > CR p ] ≤ 2e -cd + e -c n , P |c p ts | > d n ≤ C e -c n L + 2e -c d for some constants c, c , C, C and d ≥ 0. Proof: Deferred to D.3.4. We will use these results in order to bound 1 E δKρ Γ : J v p 2 using (D.85). While the sum over most of these terms can be controlled using the triangle inequality and the lemma above, there is a subset which will require special treatment since they are typically larger. These are the terms where the leftmost or rightmost chain is a G-chain (meaning i 2r = or i 1 = respectively) and they will be controlled using martingale concentration. The or above is exclusive, since we can bound terms with i 2r = and i 1 = using a triangle inequality. We denote these three sets of terms by ← -G r,p , -→ G r,p , ← → G r,p respectively, and elements in them by ←g r,p , -→ g r,p , ← → g r,p for clarity when needed. Arranging the remaining terms into sets denoted G r,p , the sum in (D.85) decomposes into ( -)/2 r=1 (i1,...,i2r)∈Cr( , ) g r (i1,i2,...,i2r) = ( -)/2 r=1 Qr,p∈ Gr,p, ← - G r,p , - → G r,p , ← → G r,p g r,p ∈Qr,p g r,p . (D.87) We consider first terms in ← -G r,p (and hence with i 2r = ). We denote such terms by ← -g r,p (i1,i2,...,i2r-1, ) = ã b ,i2r-1,i2r-2 r-2 m=1 bi2m+2,i2m+1,i2m cp i2,i1 . We show Lemma D.16. For p ∈ {f, u} and R u = d n , R f = 1 i . P    ( -)/2 r=1 ← -g r,p ∈ ← - G r,p ← -g r,p > dL n    ≤ Ce -cd + C e -c n L (D.88) ii . P    ( -)/2 r=2 - → g r,p ∈ - → G r,p - → g r,p > CR p    ≤ Ce -cd + C e -c n L (D.89) for absolute constants c, C, C , and where d ≥ K log L for some constant K. Proof: Deferred to D.3.4. Turning next to bounding the terms in ← → G r,p , we first define an event ← → E = {|ã | ≤ C} ∩ +1≤i1≤i2-2≤i3-2≤ -2 bi3i2i1 ≤ K s √ L ∩ <i1≤ cp i1 ≤ CR p and from lemma D.15 and a union bound obtain P ← → E c ≤ L 3 C e -c n L + L 2e -cd + e -c n ≤ C e -c n L + 2e -c d assuming n ≥ KL log L, d ≥ K log L for some K, K . It follows that 1← → E ← → g r,p ∈ ← → G r,p ← → g r,p = 1← → E ( -)/2 r=1 (i2,...,i2r-1)∈Cr-1( , ) ã b ,i2r-1,i2r-2 r-2 m=1 bi2m+2,i2m+1,i2m ci2, ≤ ( -)/2 r=1 (i2,...,i2r-1)∈Cr-1( , ) 1← → E ã b ,i2r-1,i2r-2 r-2 m=1 bi2m+2,i2m+1,i2m ci2, ≤ ( -)/2 r=1 L 2r-2 K s √ L r-1 R p = ( -)/2 r=1 K s L 3/2 r-1 R p ≤ 1 -K s L 3/2 L/2 1 -K s L 3/2 R p ≤ 2R p . where we used L 3/2 K s ≤ 1 2 . We also bound the number of summands in (i2,...,i2r-1)∈Cr-1( , ) by L 2r-2 which is tight for small r. It follows that P    ← → g r,p ∈ ← → G r,p ← → g r,p > 2R p    ≤ P    1← → E ← → g r,p ∈ ← → G r,p ← → g r,p > 2R p    + P    1 -1← → E ← → g r,p ∈ ← → G r,p ← → g r,p > 0    = P ← → E c ≤ Ce -c n L + 2e -c d (D.90) for appropriate constants. It remains to bound the terms in G r,p by a similar argument. Defining E = ≤i1< {|ã i1 | ≤ K s } ∩ +2≤i1+2≤i2≤i3≤ bi3i2i1 ≤ K s √ L ∩ <i1≤i2≤ cp i2i1 ≤ d n truncating on this event gives 1 E g r,p ∈Gr,p g r,p ≤ 1 E g r,p ∈Gr,p g r,p ≤ L 3/2 K s r dL n ≤ 2 dL n and bounding the probability of this even from below using lemma D.15 and a union bound gives P    g r,p ∈Gr,p g r,p > 2 dL n    ≤ P    1 E g r,p ∈Gr,p g r,p > 2 dL n    + P E c ≤ Ce -c n L + 2e -c d . Combining the bound above with (D.90) and the results of lemma D.16 and worsening constants, the sum of all terms containing matrices G i is bounded by  P 1 E δKρ Γ : J v f 2 > C ≤ C e -c n L . We then apply lemma D.20 to obtain P 1 E δKρ Γ : J > C √ L ≤ C e -c n L . Recalling (D.78), to obtain our final bound on the operator norm it remains to control the probability of E ρ . We consider some ≤ i ≤ and assume α i-1 (x) = 0 (otherwise ρ i (x) 2 ≤ C with probability 1). Defining an orthogonal matrix R such that Rα i-1 (x) = α i-1 (x) 2 e 1 , rotational invariance of the Gaussian distribution gives ρ i (x) 2 2 = α i-1 (x) * W i * W i α i-1 (x) d = α i-1 (x) 2 2 W i (:,1) 2 2 , E W i ρ i (x) 2 2 = 2 α i-1 (x) 2 2 . Since W i (:,1) 2 2 is a sum of independent sub-exponential random variables with sub-exponential norm bounded by C n , Bernstein's inequality (lemma G.2) and D.2 give P ρ i (x) 2 2 > C ≤ P ρ i (x) 2 2 -α i-1 (x) 2 2 > C 2 α i-1 (x) 2 2 ≤ C 2 + P α i-1 (x) 2 2 > C 2 ≤ 2e -cn + C e -c n L ≤ C e -c n L for appropriate constants. Taking a union bound over i gives P [E ρ ] ≥ 1 -C Le -c n L ≥ 1 -C e -c n L for a suitable chosen constant c , where we used n ≥ KL log L for some K. We then have P 1 E δK Γ : J v f 2 > C ≤ P 1 E δK ∩Eρ Γ : J v f 2 > C + P 1 E δK ∩E c ρ Γ : J v f 2 > 0 ≤ P 1 E δK ∩Eρ Γ : J v f 2 > C + P E c ρ ≤ Ce -c n L + C e -c n L ≤ C e -c n L for appropriate constants, and similarly P 1 E δK Γ : J > C √ L ≤ C e -c n L . This concludes the proof of the first two statements. For the final result, we consider a vector g with g i ∼ iid N (0, 1). Bernstein's inequality gives P g P Γ L: J (x) -Γ L: HJ (x) g 2 > C √ dL ≤ P Γ L: J (x) -Γ L: HJ (x) v u 2 > C 2 dL n + P g 2 > 2 √ n ≤ e -cn + C e -c d + C e -c n L ≤ C e -c d . for appropriate constants, where we assumed n > KLd for some K. Corollary D.17. Defining Γ : (x) = P I W P I -1 . . . P I W , under the same assumptions on n, L in D.14 we have P Γ : (x)v 2 ≤ C ≥ 1 -e -c n L P Γ : (x) ≤ C √ L ≥ 1 -e -c n L for some numerical constants c, C. Lemma D.18. Fix a collection of supports J = {J . . . J } for 1 ≤ ≤ ≤ L that satisfy the assumptions of lemma D.14 and denote by v p a unit norm vector. Define an event E H = 1 E δ Γ : HJ v p 2 ≤ C ∩ 1 E δ Γ : HJ ≤ C √ L ∩ 1 E δ Γ : HJ 2 F ≤ Cn . If n ≥ KL log n for some constant K then P E H ≥ 1 -C e -c n L where c, C, C are absolute constants. Proof. In the following, we denote by W i an independent copy of W i , and by W i (:,j) the j-th column of this matrix. Note that due to rotational invariance of the Gaussian distribution we can replace every occurrence of H i in an expression by W i P S i-1⊥ without altering the distribution of the expression, which we will do presently. We can repeatedly use this rotational invariance to give where in the last inequality we used the fact that multiplication by P S i⊥ cannot increase the norm of a vector. Denoting by {χ i } a collection of independent standard chi-squared distributed random variables where χ i has |J i | degrees of freedom, we have v * p Γ : * HJ Γ : HJ v p = v * p H * P J Γ : +1 * HJ Γ : +1 HJ P J H v p d = v * p P S - i= P Ji W i (:,1) 2 2 d = i= 2 n χ i . Define E I = min ≤i≤ |I i (x)| ≥ n 4 ∩ i= 2 |I i (x)| n ≤ 2 . Denoting δ i = |J i I i (x)| , on E I we have min ≤i≤ |J i | ≥ n 4 - n L ≥ n 8 . (D.92) i= 2 |J i | n ≤ i= 2 |I i | + δ i n ≤ i= 2 |I i | + n L n = i= 2 |I i | n 1 + n L |I i | ≤ i= 2 |I i | n 1 + 4 L ≤ e 4 i= 2 |I i | n ≤ 2e 4 . where we used the assumption δ i ≤ n L and assumed n ≥ 8L. It follows that P i= 2 n χ i - i= 2 |J i | n > 1 E δ ∩ E I =P i= χ i |J i | -1 > i= n 2 |J i | E δ ∩ E I ≤P i= χ i |J i | -1 > 1 2e 4 E δ ∩ E I . An application of lemma D.26 and (D.92) then gives P i= χ i |J i | -1 > 1 2e 4 E δ ∩ E I ≤ CLe -c n L ≤ Ce -c n L (D.93) for appropriate constants, assuming n ≥ KL log L for some K. Using D.30 to bound P [E c I ] we thus have P 1 E δ v * p Γ : * HJ Γ : HJ v p > 1 + 2e 4 ≤ P 1 E δ i= 2 n χ i > 1 + i= 2 |J i | n ≤ P i= 2 n χ i > 1 + i= 2 |J i | n E δ ≤ P i= 2 n χ i > 1 + i= 2 |J i | n E δ ∩ E I + P [E c I ] ≤ Ce -c n L + C e -c n L ≤ C e -c n L for some constants. Having shown P 1 E δ v * p Γ : * HJ Γ : HJ v p > C ≤ C e -c n L for some fixed v p = v f we can now apply lemma D.20 to obtain P 1 E δ Γ : HJ > C √ L ≤ C Le -c n L ≤ C e -c n L . where we used n ≥ KL log L for some K. Choosing v p = e i for i ∈ [n] and taking a union bound, one obtains P 1 E δ Γ : HJ 2 F > Cn = P 1 E δ tr Γ : * HJ Γ : HJ > Cn = P n i=1 1 E δ e * i Γ : * HJ Γ : HJ e i > Cn ≤ n i=1 P 1 E δ e * i Γ : * HJ Γ : HJ e i > C ≤ nC e -c n L ≤ C e -c n L for some constants, where we used n ≥ KL log n for an appropriate constant K. A final union bound over the last three events gives the desired result. Proof of lemma D.15. We first consider the terms ãk . For k = , the definition of E δKρ in (D.78) gives ã = 1 E δKρ Γ : +1 HJ P J ρ (x) 2 α -1 (x) 2 = 1 E δKρ P J ρ (x) 2 α -1 (x) 2 ≤ C a.s. (D.94) In order to handle the case ≤ k < , we use (D.81) and obtain that for any 2 ≤ j ≤ i ≤ L, Γ i:j HJ P Jj-1 ρ j-1 (x) α j-2 (x) 2 = Γ i:j+1 HJ P Jj H j P Jj-1 ρ j-1 (x) α j-2 (x) 2 = Γ i:j+1 HJ P Jj H j Q j-1 (x) ρ j-1 (x) α j-2 (x) 2 d = Γ i:j+1 HJ P Jj W j P S j-1⊥ Q j-1 (x)ρ j-1 (x) d = Γ i:j+1 HJ P Jj W j (:,1) P S j-1⊥ Q j-1 (x)ρ j-1 (x) 2 α j-2 (x) 2 where W j is an independent copy of W j , and we denote by W j (:,1) the first column of W j , and we used the rotational invariance of the Gaussian distribution. Truncating on the event E i,j+1 H , which does not affect the distribution of W j (:,1) , we have E W j (:,1) 1 E i,j+1 H Γ i:j+1 HJ P Jj W j (:,1) 2 2 = 2 n 1 E i,j+1 H tr P Jj Γ i:j+1 * HJ Γ i:j+1 HJ ≤ 2 n 1 E i,j+1 H tr Γ i:j+1 * HJ Γ i:j+1 HJ = 2 n 1 E i,j+1 H Γ i:j+1 HJ 2 F ≤ C almost surely for some constant C , and the Hanson-Wright inequality (lemma G.4) gives P 1 E i,j+1 H Γ i:j+1 HJ P Jj W j (:,1) 2 2 > 1 + C ≤ 2 exp   -c min      n 2 4 1 E i,j+1 H Γ i:j+1 * HJ Γ i:j+1 HJ 2 F , n 2 1 E i,j+1 H Γ i:j+1 * HJ Γ i:j+1 HJ         ≤ 2 exp -c n (i -j + 1) for some constant c , where we used 1 E i,j+1 H Γ i:j+1 * HJ Γ i:j+1 HJ 2 F ≤ 1 E i,j+1 H Γ i:j+1 HJ 2 F Γ i:j+1 HJ 2 ≤ Cn(i -j + 1), and the fact that multiplying a matrix by P Jj cannot increase its norm. Writing for concision in the subsequent expression A = Γ i:j+1 HJ P Jj W j (:,1) 2 -1 E i,j+1 H Γ i:j+1 HJ P Jj W j (:,1) 2 it follows that P Γ i:j+1 HJ P Jj W j (:,1) 2 > C ≤ P 1 E i,j+1 H Γ i:j+1 HJ P Jj W j (:,1) 2 > C + P [A > 0] ≤ 2 exp -c n L + C exp -c n L ≤ C exp -c n L for appropriate constants, where we used D.18. Since on E δKρ we have P S j⊥ Q j (x) ρ j-1 (x) α j-2 (x) 2 2 ≤ Q j (x) ρ j-1 (x) α j-2 (x) 2 2 ≤ 2K s , we obtain P 1 E δKρ Γ i:j HJ P Jj-1 ρ j-1 (x) α j-2 (x) 2 2 > 2C K s ≤ C exp -c n L (D.95) hence P [ã k > K s ] ≤ C exp -c n L (D.96) for appropriate constants. We now turn to controlling the bqij . Note that 1 E δKρ q k=i s k = 1 E δKρ α q (x) 2 α i-1 (x) 2 q k=i 1 + α k (x) * α k (x) 2 2 Q k (x)W k α k-1 (x) ≤ 3 (1 + 2K J ) q-i ≤ 3e 2K J L ≤ 9 a.s. (D.97) where in the last inequality we used 2K J L < 1. Additionally, we have 1 E δKρ α i (x) * α i (x) * 2 Γ i-1:j+1 HJ P Jj ρ j (x) α j (x) 2 d = 1 E δKρ α i (x) * P Ji-1 α i (x) * 2 W i-1 P S i-2⊥ Γ i-2:j+1 HJ P Jj ρ j (x) α j (x) 2 . = 1 E δKρ α i (x) * P Ji-1 α i (x) * 2 W i-1 u d = σg where W i-1 is an copy of W i-1 that is independent of all the other variables, we defined u = P S i-2⊥ Γ i-2:j+1 HJ P Jj ρ j (x) α j (x) 2 and g is a standard normal variable. In the above expression, σ 2 = 1 E δKρ 2 n α i (x) * P Ji-1 / α i (x) * 2 2 2 u 2 2 ≤ 1 E δKρ 2 u 2 2 n . Note also that Γ i-2:j+1 HJ is well-defined since i ≥ j -1. We therefore have P 1 E δKρ α i (x) * α i (x) * 2 Γ i-1:j+1 HJ P Jj ρ j (x) α j (x) 2 > CK s √ L ≤ P 1 E δKρ u 2 > K s + P g 2 n K s > K s √ L ≤ C e -c n L + 2e -c n L ≤ C e -c n L (D. 98) where we used (D.95) and the Gaussian tail probability to bound the first and second terms in the second line respectively. Combining the above with (D.97) we obtain P bqij > K s √ L ≤ C e -c n L . (D.99) We now turn to controlling cp ji . If i ≥ we have cp j,i+1 = 1 E δKρ j-1 k=i+2 s k α i+1 (x) * α i+1 (x) * 2 Γ i: HJ v p d = 1 E δKρ j-1 k=i+2 s k α i+1 (x) * P Ji α i+1 (x) * 2 W i P S i-1⊥ Γ i-1: HJ v p d = σg (D.100) where g is a standard normal and σ 2 = 1 E δKρ 2 n j-1 k=i+2 s k 2 P S i-1⊥ Γ i-1: HJ v p 2 2 ≤ 1 E δKρ 2C n Γ i-1: HJ v p 2 2 a.s. for some constant C where we used (D.97). We also have Γ i-1: HJ v p 2 2 ≤ C on E i-1, H . Lemma D.18 and a Gaussian tail bound then give P cp j,i+1 > d n ≤ P 1 E δKρ ∩E i-1, H j-1 k=i+2 s k α i+1 (x) * Γ i: HJ vp α i+1 (x) * 2 > d n +P 1 E δKρ -1 E δKρ ∩E i-1, H j-1 k=i+2 s k α i+1 (x) * Γ i: HJ vp α i+1 (x) * 2 > 0 ≤ P 2 n Cg > d n + P E i-1, H c ≤ 2e -cd + Ce -c n L (D.101) for appropriate constants. Additionally, if i = -1 we have from (D.97) for some fixed v p = v f cf j, = 1 E δKρ j-1 k= +1 s k α (x) * α (x) 2 Γ -1: HJ v f = 1 E δKρ j-1 k= +1 s k α (x) * α (x) 2 v f = 1 E δKρ j-1 k= +1 s k ≤ 9 a.s. (D.102) If however v p = v u is drawn from Unif(S n-1 ), if we denote by g a vector with independent standard Gaussian entries we have 1 E δKρ α (x) * α (x) 2 v u d = e * 1 v u d = g 1 g 2 . From Bernstein's inequality it follows that P g 2 2 < n 2 ≤ e -cn . Combining this with a Gaussian tail bound gives P g 1 g 2 > d n ≤ P g 2 2 < n 2 + P |g 1 | > d 2 ≤ e -cn + 2e -c d for some constants c, c . From (D.97) it follows that P cu j, > d n = P 1 E δKρ j-1 k= +1 s k α (x) * α (x) 2 v u > d n ≤ e -cn + 2e -c d for appropriate constants. Proof of lemma D.16. Part (i). We denote the set of all such terms with r G-chains by ← -G r,p . Considering first the contribution from the terms with a single G-chain, denoted ← -G 1p , we have ← -g 1,p ∈ ← - G 1p ← -g 1,p = ã j= +1 cp ,j where j= +1 cp ,j = j= +1 1 E δKρ -1 k=j+1 s k α j (x) * Γ j-1: HJ v p α j (x) 2 . Denoting by σ(A 1 , . . . , A k ) the sigma-algebra generated by a the random variables A 1 , . . . , A k , we define a filtration F -1 = σ v p , ρ 1 (x), . . . , ρ L (x) , F j = σ v p , ρ 1 (x), . . . , ρ L (x), H , . . . , H j , j = , . . . , . (D.103) The sequence {X i } = 1+i j= +1 c ,j is adapted to the filtration, and since the summands are linear in the zero mean H k the sequence is a martingale (E X i+1 | F i = X i ). The martingale difference sequence is ∆ i = X i -X i-1 = cp ,i+1 = 1 E δKρ -1 k=i+2 s k α i+1 (x) * Γ i: HJ v p α i+1 (x) 2 giving j= +1 cp ,j = X -1 = -1 j= +1 ∆ i + X = -1 j= +1 ∆ i + cp , +1 . We cannot control this sum directly because we do not have almost sure control of the martingale differences. To remedy this, we recall the event E i-1, +1 H defined in lemma D.18, and decompose the sum of interest into -1 j= +1 ∆ i ≤ -1 j= +1 ∆ i -1 E i-1, +1 H ∆ i + -1 j= +1 1 E i-1, +1 H ∆ i . (D.104) Notice that the second sum is also a sum of zero-mean martingale differences. Using (D.100), we have 1 E i-1, +1 H ∆ i d = 1 E i-1, +1 H σg where g ∼ N (0, 1) and 1 E i-1, +1 H σ 2 = 1 E i-1, +1 H 2 n -1 k=i+1 s k 2 Γ i-1: HJ v p 2 2 ≤ C n almost surely for some constant C. It follows that E exp λ1 E i-1, +1 H ∆ i F i-1 ≤ exp cnλ 2 ∀λ, a.s. and we can apply Freedman's inequality for martingales with sub-Gaussian increments (lemma G.7) to conclude that for some d ≥ 0 P   -1 j= 1 E i-1, +1 H ∆ i > √ d   ≤ 2 exp -c dn L . As for the first term in (D.104), using lemma D.18 we have P   -1 j= ∆ i -1 E i-1, +1 H ∆ i > 0   ≤ - i=1 P E i-1, +1 H c ≤ LC e -c n L ≤ C e -c n L for appropriate constants, where we assumed n ≥ KL log L for some K. Combining the above with (D.94), and using (D.101) to give P ã cp , +1 > C ≤ C e -c n L for some constants and applying the triangle inequality, we have P    ← -g 1,p ∈ ← - G 1p ← -g 1,p > C √ d    ≤ P   ã cp , +1 + |ã | -1 j= +1 ∆ i > C √ d   ≤ C e -c n L + C e -c dn L (D.105) for appropriate constants. Having controlled the sum of terms in ← -G 1p , we next consider a sum over the terms in ← -G r,p for r ≥ 2. The argument will be very similar to the ← -G 1p case, with some additional technical details. Note that since different G-chains must be separated by an H i matrix for some i and we consider only terms with i 1 ≥ + 1, the minimal starting index of the r-th chain (indexed by j below for clarity) is + 1 + 2(r -1). The sum of all possible terms is thus ← -g r,p ∈ ← - G r,p ← -g r,p =ã + 2r -1 ≤ j ≤ , (i 1 , . . . , i 2r-2 ) ∈ C r-1 ( + 1, j -2) b ,j,i2r-2 r-2 m=1 bi2m+2,i2m+1,i2m cp i2,i1 . = j= +2r-1 pr ,j The constraints on the indices i 1 , . . . , i 2r-2 are similar to those in (D.84), with the starting index reflecting the constraint i 1 > in the definition of ← -G r,p . We once again define the filtration F j as in (D.103) for -1 ≤ j ≤ . Noting that ã , bk,l,m = 1 E δKρ k-1 q=l+1 s q α l (x) * Γ l-1:m+1 HJ P Jm ρ m (x) α l (x) 2 α m-1 (x) 2 and cp k,l = 1 E δKρ k-1 q=l+1 sqα l (x) * Γ l-1: HJ vp α l (x) 2 are all F l-1 -measurable, the index constraints imply that X r i = i+1 j= +2r-1 pr ,j is F i -measurable and thus the sequence {X r i } is adapted to the filtration. b ,i+1,i2r-2 is a linear function of the zero-mean variables H i for any choice of i 2r-2 , and we can replace H i with W i P S i-1⊥ where W i is an independent copy of W i without altering the distribution of X r i . Since bklm for k ≤ i 2r-2 is independent of the W i for any choice of l, m, it follows that pr ,i+1 is also a linear function of the variables in W i which have zero mean. Consequently E X r i | F i-1 = i j= +2r-1 pr ,j = X r i-1 , hence {X i } is a martingale sequence. Defining martingale differences ∆ r i = X r i -X r i-1 = pr ,i+1 we have j= +2r-1 pr ,j = -1 i= +2r-2 ∆ r i . (D.106) We define an event E i ∆ ∈ F i by E i ∆ = {|ã | ≤ C} ∩ i1+2≤i2≤i3≤i bi3i2i1 ≤ Ks √ L ∩ i1≤i2≤i cp i2i1 ≤ C d n ∩ i1≤i 1 E δKρ Γ i:i1 HJ P J i 1 -1 ρ i 1 -1 (x) α i 1 -2 (x) 2 2 ≤ CK s (D.107) for i 1 ≥ + 1 and decompose the sum in (D.106) into -1 i= +2r-2 ∆ r i ≤ -1 i= +2r-2 ∆ r i -1 E i-1 ∆ ∆ r i + -1 i= +2r-2 1 E i-1 ∆ ∆ r i . (D.108) In order to control the second term, we note that 1 E i-1 ∆ ∆ r i = 1 E i-1 ∆ pr ,i+1 =1 E i-1 ∆ ã (i1,...,i2r-2)∈Cr-1( +1,i-1) b ,i+1,i2r-2 r-2 m=1 bi2m+2,i2m+1,i2m cp i2,i1 =1 E i-1 ∆ ã (i1,...,i2r-2)∈Cr-1( +1,i-1) 1 E δKρ -1 q=i+2 s q α i+1 (x) * Γ i:i 2r-2 +1 HJ P J i 2r-2 ρ i 2r-2 (x) α i+1 (x) 2 α i 2r-2 -1 (x) 2 * r-2 m=1 bi2m+2,i2m+1,i2m cp i2,i1 and using Γ i:i2r-2+1 HJ = P Ji H i Γ i-1:i2r-2+1 HJ d = P J (i) W i P S i-1⊥ Γ i-1:i2r-2+1 HJ where W i is an independent copy of W i gives 1 E i-1 ∆ ∆ r i d = σg where g ∼ N (0, 1) and σ = 2 n 1 E δKρ ∩E i-1 ∆ -1 q=i+2 s q ã (i1,...,i2r-2) ∈Cr-1( +1,i-1) P S i-1⊥ Γ i-1:i 2r-2 +1 HJ P J i 2r-2 ρ i 2r-2 (x) α i 2r-2 -1 (x) 2 * r-2 m=1 bi2m+2,i2m+1,i2m cp i2,i1 2 ≤ 2 n 1 E δKρ ∩E i-1 ∆ -1 q=i+2 s q ã (i1,...,i2r-2) ∈Cr-1( +1,i-1) Γ i-1:i 2r-2 +1 HJ P J i 2r-2 ρ i 2r-2 (x) 2 α i 2r-2 -1 (x) 2 * r-2 m=1 bi2m+2,i2m+1,i2m cp i2,i1 ≤ 2 n 1 E δKρ ∩E i-1 ∆ -1 q=i+2 s q ã L 2r-2 max i1≤i-1 1 E δKρ ∩E i-1 ∆ Γ i-1:i 1 HJ P J i 1 -1 ρ i 1 -1 (x) 2 α i 1 -2 (x) 2 * max i1+2≤i2≤i3≤i-1 1 E δKρ ∩E i-1 ∆ bi3i2i1 r-2 max i1≤i2≤i-1 1 E δKρ ∩E i-1 ∆ cp i2i1 ≤ a.s. C √ dL n L 2r-2 K s √ L r-1 = C √ dL n L 3/2 K s r-1 . In the last inequality we used the definition of E i ∆ and the assumption L 3/2 K s ≤ 1. It follows that E exp λ1 E i-1 ∆ ∆ i F i-1 ≤ exp cn 2 λ 2 dL L 3/2 K s 2r-2 ∀λ, a.s. and we can apply Freedman's inequality for martingales with sub-Gaussian increments (lemma G.7) to conclude P -1 i= +2r-2 1 E i-1 ∆ ∆ i > L 3/2 K s r-1 dL n ≤ 2 exp -c n L . (D.109) It remains to bound the first term in (D.108). Using lemma D.15, (D.95) and taking a union bound over i 1 , i 2 , i 3 in (D.107) we have P E i ∆ ≥ 1 -CL 3 e -c n L -C L 2 e -c n L + e -c d ≥ 1 -C e -c n L -C e -c d where we assume n ≥ KL log L and d ≥ K log L for some K, K . An additional union bound over i gives P -1 i= +2r-2 ∆ r i -1 E i-1 ∆ ∆ r i > 0 ≤ - i=1 P E i-1 ∆ c ≤ LC e -c n L -e -c d ≤ C e -c n L -e -c d for appropriate constants. Combining the above with (D.109) and recalling (D.106) gives P   j= +2r-1 pr ,j > L 3/2 K s r-1 dL n   ≤ Ce -c n L + C e -c d for some constants. L 3/2 K s ≤ 1 2 implies ( -)/2 r=2 L 3/2 K s r-1 = L 3/2 K s ( -)/2 -2 r=0 L 3/2 K s r = L 3/2 K s   1 -L 3/2 K s ( -)/2 -1 1 -L 3/2 K s   ≤ 2. A final union bound over r and a rescaling of d gives P    ( -)/2 r=2 ← -g r ∈ ← - G r ← -g r > dL n    =P    ( -)/2 r=2 j= +2r-1 pr ,j > dL n    ≤CLe -c n L + C Le -c d ≤ C e -c n L + C e -c d (D.110) for appropriate constants, again assuming assume n ≥ KL log L and d ≥ K log L for some K, K . Combining the above with equation (D.105) and worsening constants gives P    ( -)/2 r=1 ← -g r ∈ ← - G r ← -g r > dL n    ≤ Ce -c n L + C e -c d for appropriate constants. Part (ii). We consider terms in the sets -→ G r,p (with i 1 = and i 2r ≤ -1). In contrast to the previous section, the bounds on these terms will differ based on the value of the p subscript (denoting whether we use a fixed vector v f or a random vector v u ). We first consider - → G 1,p , noting ← -g 1,p ∈ ← - G 1p ← -g 1,p = -1 j= ãj cp j, . Lemma D.15 and a union bound give P   -1 j= ãj ≤ K s ∩ -1 j= cf j, ≤ C ∩ -1 j= cu j, ≤ C d 0 n   ≥ 1 -LC e -c n L -L 2e -c d0 -e -c n ≥ 1 -C e -c n L -2e -c d0 for appropriate constants, where we assume n ≥ KL log L and d 0 ≥ K log L for some constants K, K . With the same probability we have ← -g 1p ∈ ← - G 1p ← -g 1p ≤ ← -g 1p ∈ ← - G 1p ← -g 1p ≤ CLK s R 0 p ≤ CR 0 p (D.111) where we defined R 0 u = d 0 n , R 0 f = 1 and used LK J ≤ 1. We next consider sums of terms in -→ G r,p for r > 1. In controlling the sum of these terms, the proof will proceed along similar lines to the previous section. The main tool we will be utilizing is martingale concentration. Recall that since i 2r ≤ -1 and every two G-chains are separated by a matrix H i , the starting index of the final G-chain is no larger than -2r + 1. We thus have - → g r,p ∈ - → G r,p - → g r,p = ≤ j ≤ -2r + 1, (i 3 , . . . , i 2r ) ∈ C r-1 (j + 2, -1) ãi2r r-1 m=2 bi2m+2,i2m+1,i2m bi4,i3,j cp j, . = -2r+1 j= pr,p j . Define a filtration F 0 = σ v p , ρ 1 (x), . . . , ρ L (x) , F j = σ v p , ρ 1 (x), . . . , ρ L (x), H , . . . , H -j+1 , j ∈ [ -+ 1] (note the reversed indexing convention compared to the filtration defined in (D.103)). Since pr,p j is F -j -measurable (as can be seen from (D.86)), we can define X r,p i = j= -i pr,p j and it follows that X r,p i is F i -measurable. Recalling from (D.86) that bi4,i3,j = 1 E δKρ i4-1 k=i3+1 s k α i3 (x) * Γ i3-1:j+1 HJ P J j ρ j (x) α i3 (x) 2 α j-1 (x) 2 and hence pr,p j is linear in the zero-mean variables H j+1 , we have EX r,p i+1 |F i = E j= -i-1 pr j |F i = j= -i pr j + E H 1 ,...,H -i pr -i-1 = j= -i pr j = X i and thus the sequence {X r,p i } is a martingale with respect to this filtration. Defining martingale differences ∆ r,p i = X r,p i -X r,p i-1 = pr,p -i the sum of interest can be expressed as -2r+1 i= pr,p i = - i=2r-1 ∆ r,p i . (D.112) We now define an event which we will shortly show holds with high probability: E i-1 ∆ = (i3,...,i2r)∈Cr-1( -i+2, -1) {|ã i2r | ≤ K s } ∩ r-1 m=2 bi2m+2,i2m+1,i2m ≤ K s √ L ∩ cf -i, ≤ C ∩ cu -i, ≤ C d 1 n ∩            1 E δKρ i4-1 k=i3+1 s k α i3 (x) * Γ i3-1: -i+2 HJ P J -i+1 2 α i3 (x) 2 ≤ C √ L            ∩ 1 E δKρ P S -i⊥ Q J -i ρ -i (x) 2 α -i-1 (x) 2 ≤ 2K s . (D.113) Truncating the martingale difference on such an event gives 1 E i-1 ∆ ∆ r,p i = 1 E i-1 ∆ pr,p -i = 1 E i-1 ∆ (i3,...,i2r)∈Cr-1( -i+2, -1) ãi2r r-1 m=2 bi2m+2,i2m+1,i2m bi4,i3, -i cp -i, = 1 E i-1 ∆ × (i3,...,i2r) ãi2r r-1 m=2 bi2m+2,i2m+1,i2m 1 E δKρ i4-1 k=i3+1 s k α i3 (x) * Γ i3-1: -i+1 HJ P J -i ρ -i (x) α i3 (x) 2 α -i-1 (x) 2 cp -i, d = σ p g for a standard normal g, where we used Γ i3-1: -i+1 HJ P J -i ρ -i (x) α -i-1 (x) 2 = Γ i3-1: -i+2 HJ P J -i+1 H -i+1 P J -i ρ -i (x) α -i-1 (x) 2 = Γ i3-1: -i+2 HJ P J -i+1 H -i+1 Q J -i ρ -i (x) α -i-1 (x) 2 d = Γ i3-1: -i+2 HJ P J -i+1 W -i+1 P S -i⊥ Q J -i ρ -i (x) α -i-1 (x) 2 with W -i+1 an independent copy of W -i+1 and we have defined σ p = 2 n 1 E i-1 ∆ ∩E δKρ (i3,...,i2r)∈Cr-1( -i+2, -1) ãi2r r-1 m=2 bi2m+2,i2m+1,i2m cp -i, * i 4 -1 k=i 3 +1 s k α i 3 (x) * Γ i 3 -1: -i+2 HJ P J -i+1 2 α i 3 (x) 2 P S -i⊥ Q J -i ρ -i (x) 2 α -i-1 (x) 2 . Note that from (D.113), if we define R u = d 1 n , R f = 1, the standard deviation σ p can be bounded as σ p ≤ a.s. CK 2 s √ n K s √ L r-2 L 2r-3/2 R p ≤ CR p √ Ln L 3/2 K s r-1 where in the first inequality we used a triangle inequality, bounded the number of summands by L 2r-2 . In the second inequality we used L 3/2 K s ≤ 1 2 . Writing the sum in D.112 as - i=2r-1 ∆ r,p i ≤ - i=2r-1 1 -1 E i-1 ∆ ∆ r,p i ≤ - i=2r-1 1 E i-1 ∆ ∆ r,p i , and recognizing that the second sum is over a zero-mean adapted sequence that obeys E exp λ1 E i-1 ∆ ∆ i F i-1 ≤ exp cnλ 2 R 2 p L 3/2 K s 2r-2 ∀λ, a.s. an application of Freedman's inequality for martingales with sub-Gaussian increments (lemma G.7) gives P   - i=2r-1 1 E i-1 ∆ ∆ r,p i > R p L 3/2 K s r-1   ≤ 2 exp -t 2 2Lσ 2 p ≤ a.s. 2e -cn . Turning now to controlling the probability of E i-1 ∆ holding, we use lemmas D.15, D.18, the definition of K J and a union bound to conclude P E i-1 ∆ c ≤ L 3 Ce -c n L + L 2e -c d1 + e -c n + L 2 C e -c n L ≤ C e -c n L + e -c d1 for appropriate constants, where we assumed n ≥ KL log L, d 1 ≥ K log L for some K, K . Combining the previous two results gives P -2r+1 i= pr,p i > R p L 3/2 K s r-1 = P   - i=2r-1 ∆ r,p i > R p L 3/2 K s r-1   ≤ P   - i=2r-1 1 E i-1 ∆ ∆ r,p i > R p L 3/2 K s r-1   + P   - i=2r-1 1 -1 E i-1 ∆ ∆ r,p i > 0   ≤ 2e -cn + LP E i-1 ∆ c ≤ Ce -c n L + e -c d1 for some constants. Noting as before that  ( -)/2 r=2 L 3/2 K s r-1 ≤ 2, P    ( -)/2 r=2 - → g r,p ∈ - → G r,p - → g r,p > CR p    ≤ LC e -c n L + e -c d1 ≤ C e -c n L + e -c d1 for appropriate constants, where we used again n ≥ KL log L, d 1 ≥ K log L for some K, K . Lemma D.19. (Horn et al., 1994) Given a semidefinite matrix A, for any partitioning A =     A 11 A 12 . . . A 1b A 21 A 22 . . . . . . A b1 A bb     we have A ≤ b i=1 A ii . Lemma D.20. Given a semidefinite matrix A and unit norm v, if P [v * Av ≤ C ] ≥ 1 -C p exp -c 1 n and n > 2 log(9) c1 , then P [ A ≤ C ] ≥ 1 -C p+1 exp -c n for some constants c, c , C, C , C . Proof. We partition A into blocks of size c2n for an appropriately chosen c 2 . There are c2 such blocks, and we similarly partition the coordinates {1, . . . , n} into c2 sets K i = {1 + (i -1) c2n : i c2n } for i ∈ [ c2 ]. We proceed to bound the operator norm of the diagonal blocks using a standard ε-net argument (Vershynin, 2018) . The set of unit norm vectors supported on some K i forms a sphere S c 2 n . We can thus construct a 1 4 -net N i on this sphere with at most e log(9)c2 n points. A standard argument gives A ii ≤ C sup x∈Ni x * A ii x . We control the RHS by a taking a union bound over the net, finding P sup x∈Ni x * A ii x ≤ C = P sup x∈Ni x * Ax ≤ C ≥ 1 -|N i | C p exp -c 1 n ≥ 1 -C p exp (log(9)c 2 -c 1 ) n . We now choose c 2 to satisfy log(9)c 2 = c1 2 , and the blocks will still have non-zero size because we assume n > 2 log(9) c1 . Taking a union bound over the c2 blocks and using Lemma D.19 gives A ≤ c 2 i=1 A ii ≤ C w.p. P ≥ 1 -C p+1 exp -c n for some constants c, C, C . Lemma D.21. Assume n ≥ max {KL log n, K Ld b , K }, d b ≥ K log L for suitably chosen K, K , K , K . Define J as in Lemma D.14. For x ∈ S n0-1 and β J = W L+1 P J L W L . . . W +2 P J +1 * , denote d i = |I i (x) J i | , d = (d 1 , . . . , d L ) , and d min = min i d i . We then have P 1 E δK β J -β (x) 2 > C √ d b L + C s d 1 + 2 d 2 1 Ls n + Ld b n d 1 d 1/2 1/2 ≤ e -c max{dmin,1}s + e -c n L + e -c d b for absolute constants c, c , c , C, C , C , C , where the event E δK is defined in lemma D.14. Other useful forms of this result are P   1 E δK β J -β (x) 2 > C d b L + C Lns L 2 s + d 1 n + Ld b n d 1 2 1 2 + CL s, d   ≤ e -cs + e -c n L + e -c d b + L i= e -csi max{di,1} where s i ≥ 1.

Proof.

Denoting by H i the weight matrices projected onto the subspace orthogonal to the features as in lemma D.14, we define β H (x) = H +1 * β H (x) = W L+1 P I L (x) H L . . . H +2 P I +1 (x) H +1 * β HJ = H +1 * β HJ = W L+1 P J L H L . . . H +2 P J +1 H +1 * for = 0, . . . , L-1. Note the additional matrix compared to the standard definition of the backward features. Control of the norm of the difference between them can then be used to control the backward features and Lipschitz constant of the network. Note also that H may not be a square matrix (and indeed in the case of the Lipschitz constant it will be rectangular). We denote the number of columns of H +1 by n -1 . Writing 1 E δK β J -β (x) 2 ≤ 1 E δK β HJ -β H (x) 2 + 1 E δK β J -β HJ (x) 2 + 1 E δK β (x) -β H (x) 2 , (D.114) we begin by bounding the first term. For Γ i:j H (x), Γ i:j HJ defined as in D.14 and Q i (x) = P Ji -P Ii(x) , (D.115) we have 1 E δK β HJ -β H (x) 2 2 =1 E δK W L+1 Γ L: +1 HJ -Γ L: +1 H (x) 2 2 =1 E δK W L+1 L i= +1 Γ L:i+1 H Q i (x)H i Γ i-1: +1 HJ 2 2 . = L i= +1 b i 2 2 . We first bound b i 2 2 . Repeated use of the rotational invariance of the Gaussian distribution in a similar manner to the proof of lemma D.18 gives b i 2 2 = 1 E δK W L+1 Γ L:i+1 H Q i (x)H i Γ i-1: +1 HJ 2 2 d = n 2 L k=i+1 H k+1 (1,:) 1 E δK P I k (x) 2 2 . =ξ I k (x) H i+1 (1,:) 1 E δK Q i (x) 2 2 i-1 k= +1 H k+1 (1,:) 1 E δK P J k 2 2 . =ξ J k H +1 (1,:) 2 2 where we defined H L+1 (1,:) = 2 n W L+1 . Denoting by W k an independent copy of W k , rotational invariance gives ξ J k ≤ W k+1 (1,:) 1 E δK P J k 2 2 d = 2 n χ k where χ k is a standard chi-squared distributed random variable with |suppdiag1 E δK P J k | degrees of freedom that is independent of all the other variables in the problem, and similarly for ξ I k (x) . A product of such terms was bounded in lemma D.18, from which we obtain P n 2 L k=i+1 ξ I k (x) i-1 k= +1 ξ J k > Cn ≤ e -c n L . (D.116) We similarly have H i+1 (1,:) 1 E δK Q i (x) 2 2 ≤ a.s. 1 E δK Q i (x) W k+1 * (1,:) 2 2 . Recalling (D.115) and since d i = suppdiagQ i (x) we recognize that 1 E δK Q i (x) W k+1 * (1,:) 2 2 d = 1 E δK 2 n χ i where χ i is a standard chi-squared distributed random variable with d i degrees of freedom. If d i = 0 this variable is identically 0. Otherwise, d i ≥ 1 and Bernstein's inequality gives > C 1 n t ≤ e -ct for t > Kn -1 for some K. Combining these results with (D.116) and taking a union bound, we obtain P [χ i -d i > Csd i ] ≤ 2e -csdi ⇒ P [χ i > C sd i ] ≤ 2e -csdi ≤ 2e -cs max{di, P L i= +1 b i 2 2 > C 1 n t L i= +1 s i d i ≤ 2 L i= +1 e -csi max{di,1} + 2L(e -c t + C e -c n L ) ≤ 2 L i= +1 e -csi max{di,1} + e -c t + e -c n L (D.118) for appropriate constants, assuming t ≥ K log L, n ≥ K L log L for some K, K , which can be simplified to P L i= b i 2 2 > C 1 n ts L i= +1 d i ≤ 2 L i= e -cs max{di,1} + e -c n t + e -c n L ≤ 2Le -cs + e -c n t + e -c n L ≤ e -c s + e -c n t + e -c n L (D.119) assuming s ≥ K log L for some K . We next bound | b i , b j | for ≤ j < i ≤ L. Once again using rotational invariance starting from the last layer weights, we obtain b i , b j =1 E δK W L+1 Γ L:i+1 H (x)Q i (x)H i Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * Q j (x)Γ L:j+1 * H (x)W L+1 d = n 2 1 E δK L k=i+2 ξ I k (x) H i+1 (1,:) Q i (x)H i Γ i-1: +1 HJ Γ j-1: +1 * HJ . =Φ i-1:j-1 H j * Q j (x)Γ i:j+1 * H (x)H i+1 * (:,1) (where we interpret an empty product as unity). As before, we find using lemma D.18 that P 1 E δK n 2 L k=i+2 ξ I k (x) > Cn ≤ C e -c n L . (D.120) We proceed to bound the remaining factors in b i , b j , by first writing If i > j + 1, defining H j+1 = W j+1 P S j⊥ where W j+1 is an independent copy of W j+1 , with Γ i-1: +1 HJ denoting the matrix Γ i-1: +1 HJ with H j+1 b i , b j = 1 E δK n 2 L k=i+2 ξ I k (x) H i+1 (1,:) Q i (x)H i Φ i-1:j-1 H j * Q j (x)Γ i:j+1 * H (x)H i+1 * (:,1) d = 1 E δK n 2 L k=i+2 ξ I k (x) di ki=1 dj kj =1 H i+1 (1,ki) s ki H i (ki,:) Φ i-1:j-1 H j * (:,kj ) s kj H j+1 * (kj (1,:) in place of H j+1 (1,:) , and writing for concision Ξ ki,kj i,j = H i+1 (1,ki) H i (ki,:) Γ i-1:j+2 HJ P Jj+1(:,1) H j+1 (1,:) -H j+1 (1,:) Φ j:j-1 H j * (:,kj ) H j+1 * (kj ,1) u i+1:j+1 and Ψ ki,kj i,j = H i+1 (1,ki) H i (ki,:) Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) H j+1 * (kj ,1) u i+1:j+1 2 we have b i , b j d = 1 E δK n 2 L k=i+2 ξ I k (x) di ki=1 dj kj =1 Ξ ki,jj i,j +1 E δK n 2 L k=i+2 ξ I k (x) di ki=1 dj kj =1 Ψ ki,kj i,j . = n 2 A i,j 1 + A i,j 2 , (D.121) where we used the invariance of the Gaussian distribution to reflections around the mean, {H m } d = W m P S m-1⊥ , and the independence between the W m variables and the sign variables {s km } to absorb the latter into the former. Making a separate definition for concision 1 E δK di ki=1 H i+1 (1,ki) H i (ki,:) Γ i-1:j+2 HJ P Jj+1(:,1) . =B i,j 1 we first consider the term A i,j 1 d = B i,j 1              1 E δK dj kj =1 H j+1 (1,:) Φ j:j-1 H j * (:,kj ) H j+1 * (kj ,1) . =B i,j 2 -1 E δK dj kj =1 H j+1 (1,:) Φ j:j-1 H j * (:,kj ) H j+1 * (kj ,1) . =B i,j 3              1 E δK u i+1:j+1 2 . =B i,j 4 . Lemma D.22 gives P B i,j 4 > C = P 1 E δK u i+1:j+1 2 > C ≤ C e -c n L . (D.122) We next consider B i,j 3 . Writing B i,j 3 = 1 E δK dj kj =1 H j+1 (1,:) Φ j:j-1 H j * (:,kj ) H j+1 * (kj ,1) d = 1 E δK dj kj =1 H j+1 (1,:) Φ j:j-1 H j * (:,kj ) P S j ⊥ W j+1 * (kj ,1) First, since the variables W j+1 * (kj ,1) are independent of 1 E δK H j+1 (1,:) Φ j:j-1 H j * (:,kj ) (P J ) (kj ,kj ) , a Gaussian tail bound gives P    dj kj =1 H j+1 (1,:) Φ j:j-1 H j * (:,kj ) H j+1 * (kj ,1) > 2d n dj kj =1 1 E δK H j+1 (1,:) Φ j:j-1 H j * (:,kj ) (P J ) (kj ,kj ) 2    ≤ e -cd (D.123) for some constants and d ≥ K for some K. Two applications lemma D.22 give P   dj kj =1 1 E δK H j+1 (1,:) Φ j:j-1 H j * (:,kj ) (P J ) (kj ,kj ) 2 > Cd j 1 n t   ≤ dj kj =1 P 1 E δK H j+1 (1,:) Φ j:j-1 H j * (:,kj ) (P J ) (kj ,kj ) 2 > C 1 n t ≤ d j e -c n L + e -c t ≤ e -c n L + e -c t assuming t ≥ K log n, n ≥ K L log n for some K. Combining this bound with (D.123) we obtain P B i,j 3 > C dd j t n ≤ e -cd + e -c n L + e -c t (D.124) for appropriate constants. We now turn to bounding B i,j 2 . Define by Q j a matrix such that Q j ab = Q j ab (x) . Then B i,j 2 d = 1 E δK W j+1 (1,:) P S j⊥ Φ j:j-1 H j * Q j P S j⊥ W j+1 * (:,1) . In order to bound this term using the Hanson-Wright inequality, we first note that since P S j⊥ Φ j:j-1 H j * Q j P S j⊥ ≤ Φ j:j-1 H j * ≤ Γ j: +1 HJ Γ j-1: +1 * HJ H j * , P S j⊥ Φ j:j-1 H j * Q j P S j⊥ 2 F ≤ Φ j:j-1 H j * 2 Q j 2 F = Φ j:j-1 H j * 2 d j , we can use lemma D.14, a standard ε-net argument to control the operator norm of a Gaussian matrix and a union bound to obtain P    1 E δK P S j⊥ Φ j:j-1 H j * Q j P S j⊥ ≤ CL ∩ 1 E δK P S j⊥ Φ j:j-1 H j * Q j P S j⊥ 2 F ≤ CL 2 d j    ≥ 1 -e -c n L + e -c n ≥ 1 -e -c n L . We also have E W j+1 W j+1 (1,:) P S j⊥ Φ j:j-1 H j * Q j P S j⊥ W j+1 * (:,1) = 2 n tr P S j⊥ Φ j:j-1 H j * Q j P S j⊥ = 2 n dj kj =1 e * kj P S j⊥ Γ j: +1 HJ Γ j-1: +1 * HJ H j * e kj and using lemmas D.14 and D.22 gives P   dj kj =1 e * kj P S j⊥ Γ j: +1 HJ Γ j-1: +1 * HJ H j * e kj ≤ C d j n   ≥ 1 -d j e -c n L ≥ 1 -e -c n L . assuming n > KL log n for some K. Denoting the union of these two events by G, an application of the Hanson-Wright inequality (lemma (G.4)) gives P 1 G B i,j 2 > s 2 + 2Cd j n ≤ C exp -c min n 2 s 2 2 L 2 d j , ns 2 L (D.125) for appropriate constants and s 2 ≥ 0, and an additional union bound gives P B i,j 2 > s 2 + 2Cd j n ≤ C exp -c min n 2 s 2 2 L 2 d j , ns 2 L + e -n L . (D.126) We next turn to bounding B i,j 1 . Rotational invariance of the Gaussian distribution gives B i,j 1 = 1 E δK di ki=1 H i+1 (1,ki) H i (ki,:) Γ i-1:j+2 HJ P Jj+1(:,1) d = 1 E δK di ki=1 H i+1 (1,ki) W i (ki,1) P Ji-1 Γ i-1:j+2 HJ P Jj+1(:,1) 2 since P Ji-1 Γ i-1:j+2 HJ P Jj+1(:,1) and W i (ki,1) are independent.

Since H i+1

(1,ki) , W i (1,ki) are both sub-Gaussian with sub-Gaussian norm bounded by C √ n , the product of two such variables is a sub-exponential variable with sub exponential norm satisfying H i+1 (1,ki) W i (ki,1) ψ1 ≤ C n for some constants. Thus the first sum above is a sum of independent, zero-mean sub-exponential random variables, and Bernstein's inequality gives P di ki=1 H i+1 (1,ki) W i (ki,1) > s 1 ≤ 2e -c min{ n 2 s 2 1 d i ,ns1} (D.127) for s 1 ≥ 1 and some constant c. Since 1 E δK P Ji-1 Γ i-1:j+2 HJ P Jj+1(:,1) 2 ≤ 1 E δK Γ i-1:j+2 HJ e 1 2 we can apply lemma D.14 to obtain P 1 E δK P Ji-1 Γ i-1:j+2 HJ P Jj+1(:,1) 2 > C ≤ C e -c n L for appropriate constants. Combining the last two results gives P B i,j 1 > Cs 1 ≤ e -c min{ n 2 s 2 1 d i ,ns1} + e -c n L (D.128) for some constants. Combining the above with (D.122), (D.124) and (D.126), we have P A ij 1 ≥ C s 2 + 2d j n + dd j t n s 1 ≤ e -c min{ n 2 s 2 1 d i ,ns1} + e -c n L + e -c d + e -c min n 2 s 2 2 L 2 d j , ns 2 L + e -c t (D.129) In the above proof we assumed i > j + 1. If instead i = j + 1 we simply set Γ i-1: +1 HJ = Γ i-1: +1 HJ in the expression for A i,j 2 in (D.121) and we have b j+1 , b j = A j+1,j

2

. We now turn to controlling the term A i,j 2 . Since i > j, H i (ki,:) d = W i (ki,:) P S i-1⊥ and P S i-1⊥ Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) is independent of W i (ki,: ) , rotational invariance of the Gaussian distribution gives  A i,j 2 d = di ki=1 H i+1 (1,ki) W i (ki,1) . =C i 1 × 1 E δK   dj kj =1 P S i-1⊥ Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) 2 H j+1 * (kj ,1)   u i+1:j+1 2 . =C i, P S i-1⊥ Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) 2 , the second factor in (D.130) is also zero-mean, and it follows that P       1 E δK dj kj =1 (P S j⊥ ) (kj ,kj ) P S i-1⊥ Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) 2 W j+1 (1,kj ) > 2d n dj kj =1 1 E δK (P S j⊥ ) (kj ,kj ) P S i-1⊥ Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) 2 2       ≤ C e -cd for some constants and d ≥ 0. Since P S i-1⊥ Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) 2 2 ≤ Γ i-1: +1 HJ 2 Γ j-1: +1 * HJ H j * (:,kj ) 2 2 , applying lemmas D.22 and D.14 total of d j times and taking a union bound gives P   dj kj =1 1 E δK (P S j⊥ ) (kj ,kj ) P S i-1⊥ Γ i-1: +1 HJ Γ j-1: +1 * HJ H j * (:,kj ) 2 2 > C d j Lt n   ≤ d j e -c n L + e -c t ≤ e -c n L + e -c t where we assumed n ≥ KL log n, t ≥ K log n for some constants. Combining the above three results with (D.122) and taking a union bound, we obtain P A i,j 2 > Cs 1 d j dn -1 Lt n ≤ e -cd + e -c n L + e -c t + e -c t + e -c min{ n 2 s 2 1 d i ,ns1} for appropriate constants. Taking a union bound over this result as well as (D.129) and (D.120) allows us to bound the inner product by P | b i , b j | ≥ Cns 1 s 2 + d j n + d j dn -1 Lt n ≤ e -c min{ n 2 s 2 1 d i ,ns1} + e -c n L + e -cd + e -c min n 2 s 2 2 L 2 d j , ns 2 L + e -c t (D.131) for some constants, again assuming n ≥ KLd . At this point we obtain a bound on the sum of these inner products that will be useful in an application where the {d i } are expected to be small. Subsequently, we will derive a different expression that will be useful when they are large.

We now choose s

1 = dis n , s 2 = dj Ls n , t = n n -1 for some s ≥ 1, which gives P | b i , b j | ≥ Cd i s d j Ls n + d j n + Ldd j n ≤ C e -c min{di,di}s + C e -c n L + C e -c d for appropriately chosen constants. Note that if  d i = 0 or d j = 0 then | b i , b j | is P | b i , b j | ≥ Cd i s 2d j Ls n + Ldd j n ≤ C e -c max{dmin,1}s + C e -c n L + C e -c d . Recalling the definition of d in the lemma statement, an additional union bound over the values of i, j in the expression above combined with (D.119) gives P 1 E δK β HJ -β H (x) 2 2 > Cs d 1 + 2 d 2 1 Ls n + Ld n d 1 d 1/2 1/2 ≤ P   1 E δK β HJ -β H (x) 2 2 > Cs L i,j= +1,i =j d i 2d j Ls n + Ldd j n + Cs L i= +1 d i   ≤ L 2 C e -c max{dmin,1}s + C e -c n L + C e -c d ≤ C e -c max{dmin,1}s + C e -c n L + C e -c d C e -c s + C e -c n L + C e -c d (D. 132) for appropriate constants, where we assumed d ≥ K log L, s ≥ max{1, K log L}, n ≥ K L log L for some K, K , K . Taking a square root gives a bound on the first term in (D.114). We next consider a different bound for this term that will be useful when the {d i } are large. Our starting point will be D.131. If we set s 1 = s, s 2 = Ls and use (D.118) we obtain P 1 E δK β HJ -β H (x) 2 2 > CLns L 2 s + d 1 n + √ Ldt n d 1/2 1/2 + C t n L i= +1 s i d i ≤ P   1 E δK β HJ -β H (x) 2 2 > Cns L i,j= +1,i =j Ls + d j n + Ldd j t n + C t n L i= +1 s i d i   ≤ L 2 C exp -c min{ n 2 s 2 d ∞ , ns} + C e -c n L + C e -c d + L i= +1 e -csi max{di,1} + e -c t ≤ e -cns + e -c n L + e -c d + L i= +1 e -c si max{di,1} + e -c t (D.133) for appropriate constants under similar assumptions on n, L, d. To bound the remaining terms in (D.114), since 1 E δK β J -β HJ 2 ≤ 1 E δK W L+1 Γ L: +1 J -Γ L: +1 HJ 2 H +1 d = 1 E δK Γ L: +1 J -Γ L: +1 HJ W L+1 * 2 H +1 and we can apply lemma D.14 and an ε-net argument to bound the first and second factors respectively, to conclude P 1 E δK β J -β HJ 2 > C √ dL ≤ C e -cd + Ce c n ≤ e -cd for some d such that d ≥ K log L and assuming n > K d. An identical result holds for the last term in (D.114) where we simply choose J i = I i (x) for all i. In conclusion, using (D.132) we have P 1 E δK β J -β (x) 2 > C √ dL + C s d 1 + 2 d 2 1 Ls n + Ld n d 1 d 1/2 1/2 ≤ C e -cs + C e -c n L + C e -c d (D.134) for appropriate constants, while if we use (D.133) instead we obtain P 1 E δK β J -β (x) 2 > C √ dL + C Lns L 2 s + d 1 n + √ Ldt n d 1 2 1 2 + CLt n s, d ≤ e -cs + e -c n L + e -c d + L i= +1 e -c si max{di,1} + e -c t . (D.135) It remains to transfer control from β HJ -β H (x) 2 to β HJ -β H (x) 2 . Note that β HJ -β H (x) 2 2 = H +1 * β HJ -β H (x) 2 2 d = P S ⊥ W +1 * (:,1) 2 2 β HJ -β H (x) 2 2 where if = 0 we define P S 0⊥ = I n0×n0 . Since E W +1 P S ⊥ W +1 (:,1) 2 2 = 2 n tr [P S ⊥ ] = 2(n-1) n , Bernstein's inequality gives (assuming n > K for some K) P P S ⊥ W +1 * (:,1) 2 2 < 1 C ≤ e -cn , and hence with the same probability β HJ -β H (x) 2 2 ≤ β HJ -β H (x) 2 2 P S ⊥ W +1 * (:,1) 2 2 ≤ C β HJ -β H (x) 2 2 . The bounds (D.134), (D.135) also apply to β HJ -β H (x) 2 up to a constant factor, with the same probability up to a e -cn term which we can absorb into the existing tail by demanding n ≥ KL for some K. Lemma D.22. For any + 1 < m ≤ j + 1, k ∈ [n], if n ≥ KL for some K then P 1 E δK H j+1 (k,:) Γ j:m HJ 2 > C ≤ e -c n L and if m = + 1 P 1 E δK H j+1 (k,:) Γ j: +1 HJ 2 > C 1 n t ≤ e -c n L + e -c t assuming t ≥ K n -1 for some K . Proof. If j + 1 > m, 1 E δK H j+1 (1,:) Γ j:m HJ 2 =1 E δK W j+1 (1,:) P S j⊥ Γ j:m+1 HJ P Jm H m 2 d =1 E δK W j+1 (1,:) P S j⊥ Γ j:m+1 HJ P Jm W m P S m-1⊥ 2 ≤1 E δK W j+1 (1,:) P S j⊥ Γ j:m+1 HJ P Jm W m 2 d =1 E δK W j+1 (1,:) P S j⊥ Γ j:m+1 HJ P Jm 2 W m (1,:) 2 ≤1 E δK P S j⊥ Γ j:m+1 HJ W j+1 * (1,:) 2 W m (1,:) 2 d = P S j⊥ Γ j:m+1 HJ e 1 2 W j+1 * (1,:) 2 W m (1,:) 2 . If on the other hand j + 1 = m, we have 1 E δK H j+1 (1,:) Γ j:m HJ 2 = 1 E δK H j+1 (1,:) 2 ≤ W j+1 (1,:) 2 . Bernstein's inequality gives P W j+1 * (1,:) 2 > C ≤ C e -cn and P W +1 (1,:) 2 > C 1 n t ≤ 2e -ct for t ≥ 1, while lemma D.14 gives P 1 E δK Γ j:m+1 HJ e 1 2 > C ≤ C e -c n L . Taking union bounds gives the desired results. Lemma D.23 (Generalized backward features inner product concentration). Fix x, x ∈ S n0-1 , ν = ∠(x, x ). Define a collection of support sets J , generalized backward features β J , a constant δ s and event E δK as in lemma D.14. Assuming n ≥ max {KL log n, K }, d ≥ K log n for suitably chosen K, K , K , we have P     ∃ ∈ [L] : 1 E δK β J , β J -n 2 L-1 = 1 -ϕ ( ) (ν) π > C d 2 √ Ln + dδ s Ln + d 3/2 δ s 1 + δs √ n L 5/2     ≤ C e -cd for absolute constants c, C, C . If additionally we have P [E δK ] ≥ 1 -e -c d then the same result holds without the truncation on 1 E δK , with worse constants. Proof. Note that 1 E δK β J , β J - n 2 L-1 = 1 - ϕ ( ) (ν) π ≤ 1 E δK β J -β (x), β J + 1 E δK β (x), β J -β (x ) + 1 E δK β (x), β (x ) - n 2 L-1 = 1 - ϕ ( ) (ν) π ≤ 1 E δK β J (x ) 2 β J -β (x) 2 + 1 E δK β (x) 2 β J -β (x ) 2 + 1 E δK β (x), β (x ) -n 2 L-1 = 1 -ϕ ( ) (ν) π . (D.136) In order to bound the first two terms, we use rotational invariance of the Gaussian distribution twice to obtain 1 E δK β J 2 2 = 1 E δK W L+1 Γ L: +1 J (x ) P J (x ) 2 2 ≤ a.s. 1 E δK W L+1 Γ L: +1 J 2 2 d = 1 E δK Γ L: +1 J W L+1 2 2 d = W L+1 2 2 1 E δK Γ L: +1 J e 1 2 2 . Bernstein's inequality gives P W L+1 2 2 > Cn ≤ 2e -cn , while 1 E δK Γ L: +1 J e 1 2 2 can be bounded using D.14 to give P 1 E δK β J 2 2 > Cn ≤ C e -cn + C e -c n L ≤ C e -c n L for appropriate constants. Using lemma D.21 to bound 1 E δK β J -β (x) 2 we obtain P 1 E δK β J (x ) 2 β J -β (x) 2 > C √ dLn + C dδ s Ln + d 3/2 δ s 1 + δ s √ n L 5/2 ≤C e -c n L + C e -c d ≤ C e -c d for some constants, assuming n ≥ KLd for some K. Bounding the second term in (D.136) in an identical fashion and the last term in (D.136) using Lemma D.4 we obtain P     1 E δK β J , β J -n 2 L-1 = 1 -ϕ ( ) (ν) π > C d 2 √ Ln + dδ s Ln + d 3/2 δ s 1 + δs √ n L 5/2     ≤ P     1 E δK β J , β J -n 2 L-1 = 1 -ϕ ( ) (ν) π > C √ dLn + dδ s Ln + d 3/2 δ s 1 + δs √ n L 5/2 + C d 2 √ Ln     ≤ P   1 E δK β J 2 β J -β (x) 2 + 1 E δK β J 2 β J -β (x ) 2 > C √ dLn + dδ s Ln + d 3/2 δ s 1 + δs √ n L 5/2   +P 1 E δK β J , β J -n 2 L-1 = 1 -ϕ ( ) (ν) π > C d 2 √ Ln ≤ C e -cd + C e -c d ≤ C e -c d . for appropriate constants assuming d ≥ 1. Taking a union bound over all possible choices of ∈ [L] and using d ≥ K log L for some K gives the desired result. If we additionally have P [E δK ] ≥ 1 -e -c d for some c , we can write β J , β J - n 2 L-1 = 1 - ϕ ( ) (ν) π = 1 E δK β J , β J - n 2 L-1 = 1 - ϕ ( ) (ν) π + (1 -1 E δK ) β J , β J and since the last term is zero w.p. ≥ 1 -e -c d we obtain the same result as in the truncated case, with possibly worse constants.

D.4 AUXILIARY RESULTS

Lemma D.24. There are absolute constants c 1 , C, C > 0 and absolute constants K, K > 0 such that for any L ∈ N, if n ≥ max{K log 4 n, K L}, then for every ∈ [L] one has E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) F -1 ≤ C log n n ν -1 1 + (c 0 /64)(L -)ν -1 (1 + log L) + C n 2 . The constant c 1 is the absolute constant appearing in Lemma E.1. Proof. The case of = L follows immediately from Lemma E.1 with an appropriate choice of d ≥ K for K > 0 some absolute constant. Henceforth we assume ∈ [L -1]. We Taylor expand (with Lagrange remainder) the smooth function ϕ (L-) about the point ϕ(ν -1 ), obtaining for any t ∈ [0, π] ϕ (L-) (t) = ϕ (L-+1) (ν -1 ) + φ(L-) (ϕ(ν -1 )) t -ϕ(ν -1 ) + φ(L-) (ξ) 2 t -ϕ(ν -1 ) 2 , where ξ is some point of [0, π] lying in between t and ϕ(ν -1 ). In particular, putting t = ν , we obtain ϕ (L-) (ν )-ϕ (L-+1) (ν -1 ) = φ(L-) (ϕ(ν -1 )) ν -ϕ(ν -1 ) + φ(L-) (ξ) 2 ν -ϕ(ν -1 ) 2 , where ξ is some point of [0, π] lying in between ν and ϕ(ν -1 ). By (C.23) and (C.26), we have that φ(L-) ≤ 0, whence ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≤ φ(L-) (ϕ(ν -1 )) ν -ϕ(ν -1 ) . (D.137) Using Lemma E.5 and an induction, we have that φ(L-) is decreasing, and moreover by the concavity property we have ϕ(ν -1 ) ≥ ν -1 /2. An application of Lemmas E.1 and C.15 then yields E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) F -1 ≤ C ν -1 log n n + C n -c1d 1 1 + (c 0 /4)(L -)ν -1 ≤ C log n n ν -1 1 + (c 0 /4)(L -)ν -1 + C n -c1d , as long as d ≥ K and n ≥ K d 4 log 4 n. In particular, we can choose d = max{K, 2/c 1 } to obtained the claimed error for the upper bound. Next, for the lower bound, we make use of the estimate φ(L-) (ν) ≥ - C 1 + (c 0 /8)(L -)ν 1 + 1 (c 0 /8)ν log (1 + (c 0 /8)(L --1)ν) f (ν) , which follows from Lemma C.16 and φ(L-) ≤ 0; by that lemma, we have that f is increasing. By Lemma E.3, as long as n ≥ K log 4 n, there is an event E on which |ν -ϕ(ν -1 )| ≤ C ν -1 log n/n + C n -3 and which satisfies P[E | F -1 ] ≥ 1 -C n -3 . In particular, on the event E we have ν ≥ ν -1 /4 -C /n 3 provided n ≥ 16C 2 log n, and so on the event E we have ξ ≥ min{ϕ(ν -1 ), ν -1 /4 -C /n 3 } ≥ ν -1 /4 -C /n 3 . We can thus write ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≥ φ(L-) (ϕ(ν -1 )) ν -ϕ(ν -1 ) + f (ξ) 2 ν -ϕ(ν -1 ) 2 = φ(L-) (ϕ(ν -1 )) ν -ϕ(ν -1 ) + (1 E + 1 E c ) f (ξ) 2 ν -ϕ(ν -1 ) 2 ≥ φ(L-) (ϕ(ν -1 )) ν -ϕ(ν -1 ) + 1 E f ( ν -1 4 -C4 n 3 ) 2 ν -ϕ(ν -1 ) 2 -(2C π 2 L)1 E c ≥ φ(L-) (ϕ(ν -1 )) ν -ϕ(ν -1 ) + f ( ν -1 4 -C4 n 3 ) 2 ν -ϕ(ν -1 ) 2 -(2C π 2 L)1 E c where the inequality in the third line follows from boundedness of the angles and the magnitude estimate on f in Lemma C.16, together with our estimate on ξ on E, and the inequality in the final line is a consequence of f ≤ 0, which allows us to drop the indicator for E and obtain a lower bound. Taking conditional expectations using the previous lower bound and applying F -1 -measurability of ν -1 and boundedness of the angles together with our conditional measure bound on E c , we obtain E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) F -1 ≥ -C log n n ν -1 1 + (c 0 /4)(L -)ν -1 - C n 2 - C 5 L n 3 + f ( ν -1 4 -C4 n 3 ) 2 E ν -ϕ(ν -1 ) 2 F -1 , where we also apply the complementary bound obtained by our previous work following (D.137). Since the CL estimate in Lemma C.16 applies also to f , and since f ≤ 0, an application of Lemma E.4 with an appropriate choice of d and the choice n ≥ K log 4 n then yields (with a larger absolute constant C ) E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) F -1 ≥ -C log n n ν -1 1 + (c 0 /4)(L -)ν -1 - C n 2 - C 6 L n 3 + C 7 log n n ν -1 2 f ν -1 4 - C 4 n 3 . If we choose n ≥ (C 6 /C )L, we can simplify this last estimate to E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) F -1 ≥ -C log n n ν -1 1 + (c 0 /4)(L -)ν -1 - 2C n 2 + C 7 log n n ν -1 2 f ν -1 4 - C 4 n 3 . To conclude, we divide our analysis into two cases: when ν -1 ≥ 8C 4 /n 3 , we have ν -1 /4 -C 4 n -3 ≥ ν -1 /8, and so ν -1 2 f ν -1 4 - C 4 n 3 ≥ ν -1 2 f ν -1 8 = 64 ν -1 8 2 f ν -1 8 ≥ - 8Cπν -1 1 + (c 0 /64)(L -)ν -1 1 + 8 log(L -) c 0 π , where the last inequality follows from Lemma C.16. On the other hand, when ν -1 ≤ 8C 4 /n 3 , the CL estimate in Lemma C.16 implies ν -1 2 f ν -1 4 - C 4 n 3 ≥ - 64CC 2 4 L n 6 ≥ - 64CC 2 4 L n 3 ≥ - 2C n 2 , where the last estimate holds when n ≥ (32CC 2 4 /C )L. Adding these two estimates together, we obtain one that is valid regardless of the value of ν -1 , and choosing n ≥ C 7 log n to combine the residuals, we obtain (after worst-case adjusting the constants) E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) F -1 ≥ - 4C n 2 - C 8 log n n ν -1 (1 + log(L -)) 1 + (c 0 /64)(L -)ν -1 . Combining with our previous work, we obtain E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) F -1 ≤ C log n n ν -1 (1 + log L) 1 + (c 0 /64)(L -)ν -1 + C 1 n 2 after worst-casing constants. Lemma D.25. There are absolute constants c 1 , C, C , C , C > 0 and absolute constants K, K > 0 such that for any d ≥ K, if n ≥ K d 4 log 4 n, then for every L ∈ N and every ∈ [L] one has P ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≤ d log n n 2C ν -1 1 + (c 0 /8)(L -)ν -1 + 2C n -c1d/2 F -1 ≥ 1 -C n -c1d , and E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) 2 F -1 ≤ 4C 2 d log n n ν -1 1 + (c 0 /8)(L -)ν -1 2 + C n -c1d/2 . The constant c 1 is the absolute constant appearing in Lemma E.1. Proof. We will fix the meaning of the absolute constants C, C , C > 0 throughout the proof below. By Lemma E.3, we have if d ≥ K and n ≥ K d 4 log 4 n that for every ∈ [L] P ν -ϕ(ν -1 ) ≤ C ν -1 d log n n + C n -c1d F -1 ≥ 1 -C n -c1d . (D.138) By Lemma C.15, we have the estimate φ( ) (t) ≤ 1 1 + (c 0 /2) t , valid for any ∈ N 0 . Writing Ξ = ν -ϕ(ν -1 ) so that ν = ϕ(ν -1 ) + Ξ , we have that (Ξ ) is adapted to (F ), and by the fundamental theorem of calculus ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) = ϕ(ν -1 ) ϕ(ν -1 )+Ξ dt 1 + (c 0 /2)(L -)t . The integrand is nonnegative, so by Jensen's inequality we have ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) 2 ≤ |Ξ | ϕ(ν -1 ) ϕ(ν -1 )+Ξ dt (1 + (c 0 /2)(L -)t) 2 , and an integration then yields In particular, the last condition guarantees 2C d log n/n ≤ 1/4. By concavity of ϕ via Lemma E.5, we have ϕ(ν -1 ) ≥ ν -1 /2, and using (D.138) to obtain ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) 2 ≤ Ξ 2 |1 + (c 0 /2)(L -)ϕ(ν -1 )||1 + (c 0 /2)(L -)(ϕ(ν -1 ) + Ξ )| . (D.139) Choosing d ≥ 1/c 1 , P Ξ ≤ C ν -1 d log n n + C n -c1d F -1 ≥ 1 -C n -c1d , we have by (D.140) and (D.141) as well as the concavity lower bound on ϕ P ϕ(ν -1 ) + Ξ ≥ ν -1 /4 F -1 ≥ 1 -C n -c1d as long as ν -1 ≥ (C /C)n -c1d/2 . In particular, plugging these bounds into (D.139) and taking square roots, we obtain by a union bound P ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≤ 2C d log n n ν -1 1 + (c 0 /8)(L -)ν -1 F -1 ≥ 1 -2C n -cd whenever ν -1 ≥ (C /C)n -c1d/2 . Meanwhile, when ν -1 ≤ (C /C)n -c1d/2 , if we choose n ≥ d log n we have C ν -1 d log n n + C n -c1d ≤ 2C n -c1d/2 , and we can use the 1-Lipschitz property of ϕ (L-) , which follows from Lemma E.5, to obtain using (D.138) P ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≤ 2C n -c1d/2 F -1 ≥ P ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≤ C ν -1 d log n n + C n -c1d F -1 ≥ P ν -ϕ(ν -1 ) ≤ C ν -1 d log n n + C n -c1d F -1 ≥ 1 -C n -c1d . Because |ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 )| ≥ 0, we can then obtain using a union bound P ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≤ d log n n 2C ν -1 1 + (c 0 /8)(L -)ν -1 + 2C n c1d/2 F -1 ≥ 1 -3C n -c1d , which holds regardless of the value of ν -1 . We can then obtain the second bound using this one, via a partition of the expectation: let E = ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) ≤ d log n n 2C ν -1 1 + (c 0 /8)(L -)ν -1 + 2C n -c1d/2 , so that E ∈ F , and P[E | F -1 ] ≥ 1 -3C n -c1d by our work above. Then we have E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) 2 F -1 ≤ E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) 2 1 E F -1 + π 2 E 1 E c F -1 ≤ E   2C d log n n ν -1 1 + (c 0 /8)(L -)ν -1 + 2C n -c1d/2 2 F -1   + C n -c1d ≤ 2C d log n n ν -1 1 + (c 0 /8)(L -)ν -1 + 2C n -c1d/2 2 + C n -c1d where the first inequality uses the triangle inequality to obtain ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) 2 ≤ π 2 ; the second inequality applies the definition of E, uses nonnegativity to drop the indicator in the first term, and applies the conditional measure bound on E c ; and the final inequality integrates. Using the fact that our previous choices of large n imply 2C d log n/n ≤ 1/4, and that | ν -1 1+(c0/8)(L-)ν -1 | ≤ π, we can distribute the square in this final bound and worst-case constants to obtain E ϕ (L-) (ν ) -ϕ (L-+1) (ν -1 ) 2 F -1 ≤ 4C 2 d log n n ν -1 1 + (c 0 /8)(L -)ν -1 2 + C n -c1d/2 , as claimed. Lemma D.26. Let X 1 , . . . , X L be independent chi-squared random variables, having respectively d 1 , . . . , d L degrees of freedom. Write d min = min i∈L d i and let ξ i = 1 di X i . Then there are absolute constants c, C > 0 and an absolute constant 0 < K ≤ 1 4 such that for any 0 < t ≤ K, one has P -1 + L i=1 ξ i > t ≤ CLe -cdmint 2 /L . In particular, there are absolute constants C , C > 0 and an absolute constant K > 0 such that for any d > 0, if d min ≥ K dL then one has P -1 + L i=1 ξ i > C dL d min ≤ C Le -d . Proof. For any t ≥ 0, we have by the AM-GM inequality P L i=1 ξ i > 1 + t = P   L i=1 ξ i 1/L > (1 + t) 1/L   ≤ P 1 L L i=1 ξ i > (1 + t) 1/L . By convexity of the exponential, we have (1 + t) 1/L ≥ 1 + 1 L log(1 + t), and by concavity of the logarithm we have log(1 + t) ≥ t log 2 if t ≤ 1. This implies P L i=1 ξ i > 1 + t ≤ P -L + L i=1 ξ i > Kt, , where K = log(2). Decomposing each X i into a sum of d i i.i.d. squared gaussians and applying Lemma G.2, we obtain P -L + L i=1 ξ i > t ≤ 2 exp -c min t 2 L i=1 C 2 di , t C max i 1 di ≤ 2 exp -c d min min t 2 /CL, t ≤ 2 exp -c d min t 2 L , (D.142) where the last inequality holds provided t ≤ CL, where C > 0 is an absolute constant. Thus, as long as t ≤ CL/K, we have suitable control of the upper tail of the product i ξ i . For the lower tail, writing log(0) = -∞, we have for any 0 ≤ t < 1 P L i=1 ξ i < 1 -t = P L i=1 log ξ i < log(1 -t) ≤ P L i=1 log ξ i < -t , where the inequality uses concavity of t → log(1 -t). By Lemma G.2, we have for each i ∈ [L] and every 0 ≤ t ≤ C (where C > 0 is an absolute constant) P[|ξ i -1| < t] ≤ 2e -cdit 2 , so that by a union bound and for t ≤ C √ L, we have with probability at least 1 -2Le -cdmint 2 /L that 1 -t/ √ L ≤ ξ i ≤ 1 + t/ √ L for every i ∈ [L]. Meanwhile, Taylor expansion of the smooth function x → log x in a neighborhood of 1 gives log x = (x -1) - 1 2k 2 (x -1) 2 , where k is a number lying between 1 and x. In particular, if x ≥ 1 2 we have log x ≥ (x -1) -2(x - 1) 2 , whence for t ≤ min{C √ L, 1 2 } P L i=1 ξ i < 1 -t ≤ 2Le -cdmint 2 /L + P -L + L i=1 ξ i < -t + 2t 2 ≤ 2Le -cdmint 2 /L + P -L + L i=1 ξ i < - t 2 , where the final inequality requires in addition t ≤ 1 4 . An application of (D.142) then yields the claimed lower tail provided t ≤ CL, which establishes the first claim. For the second claim, we consider the choice t = dL/cd min , for which we have t ≤ K whenever d min ≥ dL/cK 2 , and cd min t 2 /L = d. Lemma D.27. Let X 1 , . . . , X L be independent Binom(n, 1 2 ) random variables, and write ξ i = 2 n X i . Then for any 0 < t ≤ 1 4 , one has P -1 + L i=1 ξ i > t ≤ 4Le -nt 2 /8L . In particular, for any d > 0, if n ≥ 128dL then one has P -1 + L i=1 ξ i > 4 dL n ≤ 4Le -d . Proof. The proof is very similar to that of Lemma D.26. For any t ≥ 0, we have by the AM-GM inequality P L i=1 ξ i > 1 + t = P   L i=1 ξ i 1/L > (1 + t) 1/L   ≤ P 1 L L i=1 ξ i > (1 + t) 1/L . By convexity of the exponential, we have (1 + t) 1/L ≥ 1 + 1 L log(1 + t), and by concavity of the logarithm we have log(1 + t) ≥ t log 2 if t ≤ 1. This implies P L i=1 ξ i > 1 + t ≤ P -L + L i=1 ξ i > Kt, , where K = log(2). Decomposing each X i into a sum of n i.i.d. Bern( 12 ) random variables and applying Lemma G.1 twice, we obtain P -L + L i=1 ξ i > t ≤ 2e -nt 2 /2L . (D.143) This gives suitable control of the upper tail of the product i ξ i . For the lower tail, writing log(0) = -∞, we have for any 0 ≤ t < 1 P L i=1 ξ i < 1 -t = P L i=1 log ξ i < log(1 -t) ≤ P L i=1 log ξ i < -t , where the inequality uses concavity of t → log(1 -t). By Lemma G.1, we have for each i ∈ [L] P[|ξ i -1| < t] ≤ 2e -nt 2 /2 , so that by a union bound, we have that 1-t/ √ L ≤ ξ i ≤ 1+t/ √ L for every i ∈ [L] with probability at least 1 -2Le -nt 2 /2L . Meanwhile, Taylor expansion of the smooth function x → log x in a neighborhood of 1 gives log x = (x -1) - 1 2k 2 (x -1) 2 , where k is a number lying between 1 and x. In particular, if x ≥ 1 2 we have log x ≥ (x -1) -2(x -1) 2 , whence for t ≤ 1/2 P L i=1 ξ i < 1 -t ≤ 2Le -nt 2 /2L + P -L + L i=1 ξ i < -t + 2t 2 ≤ 2Le -nt 2 /2L + P -L + L i=1 ξ i < - t 2 , where the final inequality requires in addition t ≤ 1/4. An application of (D.143) then yields the claimed lower tail, which establishes the first claim. For the second claim, we consider the choice t = 8dL/n, for which we have t ≤ 1 4 whenever n ≥ 128dL, and nt 2 /8L = d. Lemma D.28. For 1 ≤ < ≤ L -1 define events Ẽ : B = B -1: xx 2 F ≤ C 2 n( -) ∩ B -1: xx ≤ C( -) ∩ tr B -1: xx ≤ Cn E : B = α -1 (x) 2 α -1 (x ) 2 > 0 ∩    ϕ ( -1) (ν) -ν -1 ≤ C d 3 log 3 n n    ∩ Ẽ : B . If n, L satisfy the assumptions of corollary D.17 then P Ẽ : B ≥ 1 -C n( -) 2 e -c n -. If n, L additionally satisfy the conditions of lemmas D.3, E.16 and n ≥ C L log(n) (log(L) + d), then P E : B ≥ 1 -C n -cd . where c, C, C , C are absolute constants. Proof. Since tr B -1: xx = n i=1 e * i Γ -1: +2 (x)P I +1 (x) P I +1 (x ) Γ -1: +2 * (x )e i , applying corollary D.17 2n times gives P   z∈{x,x },i∈[n] Γ -1: +2 (z)e i 2 ≤ √ C   ≥ 1 -C n( -) 2 e -c n - ⇒ P tr B -1: xx ≤ Cn ≥ 1 -C n( -) 2 e -c n -. With the same probability we also have max z∈{x,x } Γ -1: +2 (z) 2 F = max z∈{x,x } tr Γ -1: +2 * (z)Γ -1: +2 (z) ≤ Cn and P max z∈{x,x } Γ -1: +2 (z) ≤ C( -) ≥ 1 -C ( -) 3 e -c n - from which it follows that P B -1: xx ≤ C( -) ≥ 1 -C ( -) 3 e -c n - and P B -1: xx 2 F ≤ C 2 n( -) ≥ P max z∈{x,x } Γ -1: +2 (z) 2 max z∈{x,x } Γ -1: +2 (x) 2 F ≤ C 2 n( -) ≥ 1 -C ( -) 3 e -c n - -C n( -) 2 e -c n - ≥ 1 -C n( -) 2 e -c n -. It follows that Ẽ : B holds with the same probability. From lemma E.16, P α -1 (x) 2 α -1 (x ) 2 > 0 ∩ ν -1 = ν -1 ≥ 1 -C e -cn for some constants c, C . Here ν -1 is the auxiliary angle process defined in (D.2). Using D.3, we obtain P   ϕ ( -1) (ν) -ν -1 ≤ C d 3 log 3 n n   ≥ P   ϕ ( -1) (ν) -ν -1 ≤ C d 3 log 3 n n E   +P [E c ] ≥ 1 -C e -cn -C n -cd ≥ 1 -C n -cd for an appropriate choice of c, C . We conclude that P E : B ≥ 1 -C e -c n -C n 2 e -c n - -C n -c d ≥ 1 -Cn -cd for appropriately chosen constants, where we used n ≥ C log(n) (log( ) + d). Lemma D.29. For ∆ defined in (D.34) and E : B defined in lemma D.28 we have P 1 E : B ∆ > C √ d F -1 ≤ a.s. C e -cd . for some constants c, C, C . Proof. xx 1 E : B ∆ =1 E : B (∆ -E∆ |F -1 ) =1 E : B L-1 i= 1 - ϕ (i) (ν) π tr B : xx -E W tr B : W * P xx W -E W W * P xx W Defining S = span{α (x), α (x )} we decompose W into a sum of two independent terms as W = W P S -1 + W P S -1⊥ ≡ G + H . Note that each H is independent of every other random variable in the problem conditioned on the features. 1 E : B tr B : xx thus decomposes into a sum of four terms, which we proceed to consider individually and show that they concentrate. The all G term is 1 E : B tr B -1: xx G * P xx G = 1 E : B dim S -1 i,j=1 u -1 * j B -1: xx u -1 i u -1 * i W * P xx W u -1 j where {u -1 i } is an orthonormal basis of S -1 . If α -1 (x) = α -1 (x ) we choose (u -1 1 , u -1 2 ) = ( α -1 (x) α -1 (x) 2 , P α -1 (x)⊥ α -1 (x ) P α -1 (x)⊥ α -1 (x ) 2 ), (which are well-defined on E : B ).Using rotational invariance of the Gaussian distribution, we have u * i W -1 * P -1+1 xx W -1 u j d = u -1 * i R * W -1 * P W -1 Rα -1 (x )>0 P W -1 Rα -1 (x)>0 W -1 Ru -1 j d = P g1 cos ν -1 +g2 sin ν -1 >0 g i , P g1>0 g j where g i ∼ iid N (0, 2 n I). If α -1 (x) = α -1 (x ) then dim S -1 = 1 and we simply choose u -1 1 = α -1 (x) α -1 (x) 2 and end up with an identical expression, with ν -1 = 0. Since P g1>0 g j and P g1 cos ν -1 +g2 sin ν -1 >0 g i are vectors of independent sub-Gaussian random variables with sub-Gaussian norm bounded by C n , their inner product is a sum of independent sub-exponential variables with sub-exponential norm bounded by C n for some constant C. Momentarily abbreviating ṽ = g 1 cos ν -1 + g 2 sin ν -1 , Bernstein's inequality then gives P P ṽ>0 g i , P g1>0 g j -E g1,g2 P ṽ>0 g i , P g1>0 g j > d n F -1 ≤ a.s. 2e -cd (D.144) for some constant c. Since on E : B , B -1: xx ≤ C , we obtain 1 E : B tr B -1: xx G * P xx G ≤ C dim S -1 i,j=1 u -1 * i W * P xx W u -1 j almost surely and thus P 1 E : B tr B -1: xx G * P xx G > C d n F -1 ≤ a.s. 2e -cd for some C , and hence xx P   1 E : B tr B -1: xx G * P xx G -E W 1 E : B tr B -1: xx G * P xx G ≤ 2C √ d F -1   ≤ a.s. G * P xx H + tr B -1: xx H * P xx G . Considering the first of these (since the second can be treated in an identical fashion), we recall that H is independent of all the other random variables in the problem conditioned on the features, and we thus have 1 E : B tr B -1: xx G * P xx H d =1 E : B tr B -1: xx G * P xx W P S -1⊥ = dim S -1 i=1 1 E : B v * i W w i where W is an independent copy of W , v i = P xx u i , w i = P S -1⊥ B -1: xx u i . Hence conditioned on all the other variables, v * i W w i is a zero-mean Gaussian with variance 2 v i 2 2 w i 2 2 n 1 E : B . Again from the bound on B -1: xx implied by E : B , we have 2 v i 2 2 w i 2 2 n 1 E : B ≤ C 2 n almost surely. Noting that E W tr B -1: xx G * P xx H = E G E W tr dim S -1 i=1 v * i W w i = 0, a Gaussian tail bound gives P 1 E : B tr B -1: xx G * P xx H -E W 1 E : B tr B -1: xx G * P xx H > √ F -1 ≤ a.s. 2e -c n . (D.146) The final term in tr B : xx is tr B -1: xx H * P xx H d = tr B -1: xx P S -1⊥ W * P xx W P S -1⊥ . Due to the independence of W from the remaining random variables, this is simply a Gaussian chaos in n 2 variables. The Hanson-Wright inequality (lemma G.4) gives P 1 E : B tr B -1: xx H * P xx H -E W 1 E : B tr B -1: xx H * P xx H ≥ t F -1 ≤ a.s. 2 exp   -cnt min      t 1 E : B P S -1⊥ B -1: xx P S -1⊥ 2 F , 1 E : B P S -1⊥ B -1: xx P S -1⊥         ≤ a.s. 2 exp -c n t min t n , where in the last inequality we used the definition of E : B . It follows that P 1 E : B tr B -1: xx H * P xx H -E W 1 E : B tr B -1: xx H * P xx H > 2 √ d F -1 ≤ a.s. P 1 E : B tr B -1: xx H * P xx H -E H 1 E : B tr B -1: xx H * P xx H > √ d F -1 +P E H 1 E : B tr B -1: xx H * P xx H -E W 1 E : B tr B -1: xx H * P xx H > √ d F -1 ≤ a.s. P 1 E : B tr B -1: xx H * P xx H -E H 1 E : B tr B -1: xx H * P xx H > √ d F -1 +P   1 E : B 2tr P S -1⊥ B -1: xx P S -1⊥ n tr P xx -E H tr P xx > √ d F -1   ≤ a.s. Ce -cd (D.147) where in the last inequality we used (D.144) and the properties of E : B . Collecting terms and using (D.145), (D.146), (D.147) we obtain P 1 E : B tr B : xx -E W tr B : xx > C √ d F -1 ≤ a.s. C e -cd (D.148) and hence P 1 E : B ∆ > C √ d F -1 = P 1 E : B L-1 i= 1 - ϕ (i) (ν) π tr B : xx -E W tr B : xx > C √ d F -1 ≤ a.s. C e -cd . (D.149) Lemma D.30. For x ∈ S n0-1 and ∈ [L], denote I (x) = supp(α (x) > 0). If n ≥ K then P min |I (x)| ≥ n 4 ≥ 1 -2LCe -cn and for any 0 ≤ t ≤ 1 P L =1 2 |I (x)| n -1 ≥ t ≤ 2 exp -c n L t 2 where c, c , C, K are absolute constants. Proof. Consider the activations at layer . From lemma E.16, if n ≥ K we have P α -1 (x) 2 > 0 ≥ 1 -Ce -cn . Rotational invariance of the Gaussian distribution gives α (x) = W α -1 (x) + d = α -1 (x) 2 W (:,1) + . It follows that E |I (x)| α -1 (x) 2 > 0 = E n i=1 1 α i (x)>0 α -1 (x) 2 > 0 = E n i=1 1 W (:,1) >0 α -1 (x) 2 > 0 . From the symmetry of the Gaussian distribution E |I (x)| α -1 (x) 2 > 0 = n 2 . Since this variable is a sum of n independent variables taking values in {0, 1}, an application of Bernstein's inequality for bounded random variables (lemma G.3) gives P |I (x)| - n 2 > n 4 ≤ P |I (x)| - n 2 > n 4 α -1 (x) 2 > 0 + P α -1 (x) 2 = 0 ≤ 2 exp -c n 2 /16 n + n/4 + Ce -cn ≤ C e -c n for appropriate constants. A union bound gives P L =1 |I (x)| - n 2 > n 4 ≤ 2LC e -c n from which P min |I (x)| ≥ n 4 ≥ 1 -2LC e -c n follows. To prove the second inequality, we use the AM-GM inequality which gives L =1 2 |I (x)| n 1/L ≤ 2 L =1 |I (x)| nL and hence P L =1 2 |I (x)| n ≥ 1 + t = P   L =1 2 |I (x)| n 1/L ≥ (1 + t) 1/L   ≤ P L =1 2 |I (x)| n ≥ L (1 + t) 1/L Convexity of the exponential gives (1+ t) 1/L ≥ 1 + 1 L log(1 + t) and for t ≤ 1 we have log(1 + t) ≥ t log 2, giving P L =1 2 |I (x)| n ≥ 1 + t ≤ P L =1 2 |I (x)| n -L ≥ t log 2 We note that 2 |I (x)| n d = n i=1 1 E b i where b i = 2 n θ i , θ i ∼ iid Bern( 1 2 ) and E = max i b -1 i = 0 is the event that the features at layer -1 are not identically 0. Since n i=1 1 E b i ≤ n i=1 b i a.s. we have P L =1 2 |I (x)| n -L ≥ t log 2 ≤ P L =1 n i=1 b i -L ≥ t log 2 . Since this is a sum of independent bounded random variables, an application of Bernstein's inequality for bounded random variables (lemma G.3) gives P L =1 n i=1 b i -L ≥ t ≤ 2 exp -c t 2 E(b i ) 2 + 2 n t = 2 exp -c n L t 2 for some absolute constant c , where we used L ≥ 1 ≥ t. Hence P L =1 2 |I (x)| n -1 ≥ t ≤ 2 exp -c n L t 2 .

E SHARP BOUNDS ON THE ONE-STEP ANGLE PROCESS

In this section, we characterize the process by which angles between features for different pairs of points evolve as they are propagated across one layer of the zero-time network. This section is self-contained, and as such it will occasionally overload notation used elsewhere in the document for different local purposes. In particular, we will use the notation σ(x) = [x] + for the ReLU in this section (and only in this section), and σ(g) = 1 g>0 for its weak derivative.

E.1 DEFINITIONS AND PRELIMINARIES

Let n ∈ N, with n ≥ 2. Let g 1 and g 2 be i.i.d. N (0, (2/n)I) random vectors; we use µ to denote the joint law of these random variables. We write G ∈ R n×2 for the matrix with first column g 1 and second column g 2 , and g 1 , . . . , g n for the n rows of G. If S ⊂ [n] is nonempty and A ∈ R n×m , we write A S ∈ R |S|×m to denote the submatrix of A consisting of the rows indexed by S in increasing index order. In such situations S c will always denote the complement relative to [n]. For 0 ≤ ν ≤ 2π, define random variables v ν (g 1 , g 2 ) = σ (g 1 cos ν + g 2 sin ν) , and vν (g 1 , g 2 ) = σ (g 1 cos ν + g 2 sin ν) (g 2 cos ν -g 1 sin ν) . Because vν separates over coordinates of its arguments and has each of its coordinates the product of a nondecreasing function and a continuous function, it is Borel measurable. A key property that we will use throughout this section is that the joint distribution of (g 1 , g 2 ) is rotationally invariant; in particular, it is invariant to rotations of the type G → G cos ν sin ν sin ν -cos ν , where ν ∈ [0, 2π]. Since we can write v ν = σ G cos ν sin ν , vν = σ G cos ν sin ν G -sin ν cos ν , where all of the R 2 vectors appearing above are elements of S 1 , it follows by applying rotational invariance and the specific rotation given above that (v ν , vν ) d = (v 0 , -v0 ). This equivalence is useful for evaluating expectations and differentiating with respect to ν. If 0 < c ≤ 0.5 and m ∈ N 0 with m < n, define an event E c,m = S⊂[n] |S|=m ν∈[0,2π] (g 1 , g 2 ) c ≤ I S c v ν (g 1 , g 2 ) 2 ≤ c -1 . For each c, m, the set E c,m is closed, since Av ν is a continuous function of (g 1 , g 2 ) ∈ E m for any linear map A. We further define E 0,m = k∈N E 1/(2k),m , so that E 0,m = S⊂[n] |S|=m ν∈[0,2π] {(g 1 , g 2 ) | 0 < I S c v ν (g 1 , g 2 ) 2 }, and E 0,m is Borel measurable. If c is omitted, we take the constant c in the definition to be 0.5. On E c,m we guarantee that v ν 0 ≥ m uniformly on [0, π]. Define a function X ν by X ν = 1 E1 v 0 v 0 2 , v ν v ν 2 . On E 1 , we guarantee that v ν = 0 for every ν, so X ν is well defined; because E 1 is Borel measurable, we have that X ν is Borel measurable, and moreover |X ν | ≤ 1, so X ν ∈ L p µ for every p ≥ 1. Finally, define for 0 ≤ ν ≤ π φ(ν) = E g1,g2 cos -1 X ν , ϕ(ν) = cos -1 E g1,g2 [ v 0 , v ν ]. 4. φ < -c < 0 for an absolute constant c > 0 on [0, π/2]; 5. 0 < φ < 1 and 0 > φ ≥ -C on (0, π) for some absolute constant C > 0; 6. ν(1 -C 1 ν) ≤ ϕ(ν) ≤ ν(1 -c 1 ν) on [0, π] for some absolute constants C 1 , c 1 > 0. Proof. Deferred to Appendix E.4.

E.3.1 CORE SUPPORTING RESULTS

Lemma E.6. There exist constants c, C, C , C , C , C > 0 and an absolute constant K > 0 such that for any d ≥ 1, if n and d satisfy the hypotheses of Lemmas E.9 and E.10 and moreover n ≥ Kd log n, then one has E g1,g2 cos -1 X ν -cos -1 E g1,g2 [X ν ] ≤ Cν log n n + C n -cd , and with probability at least 1 -C n -cd , one has  cos -1 X ν -E cos -1 X ν ≤ C ν d log n n + C n -cd . Proof. Fix ν ∈ [0, π]. The function cos -1 is smooth on (-δ, 1) if 0 < δ < 1, ≤ E[X ν ] < 1 if ν > 0; we will handle ν = 0 separately below) gives cos -1 (x) = cos -1 E[X ν ] - 1 1 -E[X ν ] 2 (x -E[X ν ]) - ξ 2 (1 -ξ 2 ) 3/2 (x -E[X ν ]) 2 , where ξ lies between x and E[X ν ]. Using the fact that X ν = 1 almost surely if ν > 0, which is established in Lemma E.23, we plug in x = X ν to get cos -1 E[X ν ] -cos -1 (X ν ) = 1 1 -E[X ν ] 2 (x -E[X ν ]) + ξ(X ν ) 2 (1 -ξ(X ν ) 2 ) 3/2 (X ν -E[X ν ]) 2 , (E.1) where we now express ξ as a function of X ν . From Jensen's inequality it is clear E[cos -1 X ν ] ≤ cos -1 E[X ν ], (E.2) so all that remains is to obtain a matching upper bound for the righthand side of (E.1). We will make use of the following facts, proved in subsequent sections: there are absolute constants C i > 0, i ∈ [6], c i > 0, i ∈ [5], such that 1. E[X ν ] ≤ 1 -c 5 ν 2 + C 1 e -c1n . (Lemma E.8) 2. For each ν, Var[X ν ] ≤ C 5 ν 4 log n/n + C 2 e -c2n . ( Lemma E.9) 3. With probability at least 1 -C 3 n -c3d , one has |X ν -E[X ν ]| ≤ C 6 ν 2 d log n/n + C 4 e -c4n . (Lemma E.10) Let E denote the event on which property 3 holds. Combining properties 1 and 3, we obtain with probability at least 1 -C 3 n -c3d X ν ≤ 1 -(c 5 /2)ν 2 + C 1 e -c1n + C 4 e -c4n , provided n is chosen larger than an absolute constant multiple of d log n. Thus, defining ν 0 = 4 c 5 C 1 e -c1n + C 4 e -c4n , we obtain for ν ≥ ν 0 E[X ν ] ≤ 1 - c 5 4 ν 2 , X ν ≤ 1 - c 5 4 ν 2 , (E.3) with the second bound holding with probability at least 1 -C 3 n -c3d . Considering first 0 ≤ ν ≤ ν 0 , we obtain using the triangle inequality, Lemma E.20 and property 3 cos -1 E[X ν ] -E cos -1 (X ν ) ≤ E 1 E cos -1 E[X ν ] -cos -1 (X ν ) + E 1 E c cos -1 E[X ν ] -cos -1 (X ν ) ≤ E 1 E |X ν -E[X ν ]| + E[1 E c π/2] ≤ Ce -cn + C n -c d , (E.4) with the final inequality following from the triangle inequality for the 2 norm and the fact that ν ≤ ν 0 . Meanwhile, if ν ≥ ν 0 , we have by (E.3) 0 ≤ ξ(X ν ) ≤ max{X ν , E[X ν ]} ≤ 1 - c 5 4 ν 2 with probability at least 1 -C 3 n -c3d . Using 1 -x 2 = (1 + x)(1 -x) and E[X ν ] ≥ 0, ξ(X ν ) ≥ 0, we have under this condition on ν 1 1 -E[X ν ] 2 ≤ 1 1 -E[X ν ] ≤ 2 c 5 ν (E.5) and similarly ξ(X ν ) 2 (1 -ξ(X ν ) 2 ) 3/2 ≤ 4 c 3 5 ν 3 1 E + π 2 1 E c . (E.6) Applying (E.6) and taking expectations in (E.1), we obtain by property 2 cos -1 E[X ν ] -E[cos -1 X ν ] ≤ Cν log n n + C e -cn + C n -c3d . (E.7) Together, (E.2), (E.4) and (E.7) establish the first claim provided n is chosen larger than an absolute constant multiple of d log n. For the second claim, we begin by using the triangle inequality to write cos -1 X ν -E cos -1 X ν ≤ cos -1 X ν -cos -1 E[X ν ] + cos -1 E[X ν ] -E cos -1 X ν , and then observe that our proof of the first claim implies suitable control of the second term. For the first term, if ν ≤ ν 0 we use Lemma E.20 to immediately obtain with probability at least 1-C 3 n -c3d that this term is at most Ce -cn . For ν ≥ ν 0 , we apply property 3 and the bounds (E.5) and (E.6) in the expression (E.1) to obtain with probability at least 1 -C 3 n -c3d cos -1 X ν -cos -1 E[X ν ] ≤ Cν d log n n + C ν d log n n , which is of the claimed order when n is chosen larger than an absolute constant multiple of d log n. Lemma E.7. There exist absolute constants c, C, C , C > 0 such that if n ≥ C log n, one has ϕ(ν) -cos -1 E g1,g2 [X ν ] ≤ C e -cn + C ν n . Proof. Write f (ν) = cos ϕ(ν), and h(ν) = E[X ν ] -f (ν), so that h is the residual between the two terms whose images we are trying to tie together. We will make use of the following results: 1. The function cos -1 is 1 2 -Hölder continuous on [0, 1], so that |cos -1 x -cos -1 y| ≤ |x -y| if x, y ≥ 0. (Lemma E.20) 2. For ν ∈ [0, π], we have 1 -1 2 ν 2 ≤ f (ν) ≤ 1 -c 2 ν 2 . (Lemma E.14) 3. For all 0 ≤ ν ≤ π, |h(ν)| ≤ C 1 e -c1n + C 2 ν 2 /n. (Lemma E.15) We choose n large enough that the hypotheses of Lemma E.15 are satisfied. Define ν 0 = 2 C 1 /c 2 e -c1n/2 . We split the analysis into two sub-intervals: I 1 := [0, ν 0 ], and I 2 := [ν 0 , π]. Choosing n larger than an absolute constant multiple of log n, we guarantee that I 1 and I 2 both have positive measure. On I 1 , we proceed as follows: cos -1 f -cos -1 (f + h) ≤ |h| ≤ C 1 e -c1n + C 2 ν 2 /n ≤ C 1 e -c1n + 4C 1 C 2 c -1 2 e -c1n ≤ Ce -1 2 c1n . The first inequality uses Hölder continuity of cos -1 , the second uses our bound on the residual, the third uses the definition of I 1 , and the fourth worst-cases the constants. On I 2 , we calculate |f + h| ≤ |f | + |h| ≤ C 1 e -c1n + C 2 ν 2 n + 1 -c 2 ν 2 , using the triangle inequality and our bounds on |h| and f . Using the conditions ν ≥ ν 0 and choosing n ≥ 4C 2 /c 2 , we can rearrange to get C 1 e -c1n + C 2 ν 2 n ≤ c 2 ν 2 2 , which implies |f + h| ≤ 1 -c 2 ν 2 /2 . By the control f (ν) ≤ 1 -c 2 ν 2 , valid on I 2 , we get that both f and f + h are bounded above by 1 -c 2 ν 2 /2 on I 2 ; moreover, because f ≥ 0 and f + h ≥ 0, we can apply local Lipschitz properties of cos -1 on I 2 . This yields cos -1 f -cos -1 (f + h) ≤ |h| 1 -sup I2 max{f, f + h} 2 ≤ C 1 e -c1n + C 2 ν 2 /n 1 -(1 -(c 2 /2)ν 2 ) 2 = C 1 e -c1n 1 2 c 2 ν 2 (2 -1 2 c 2 ν 2 ) + C 2 ν 2 /n 1 2 c 2 ν 2 (2 -1 2 c 2 ν 2 ) ≤ Cν -1 e -c1n + C ν/n ≤ Ce -1 2 c1n + C ν/n. Above, the first inequality is the instantiation of the local Lipschitz property; the second applies our upper and lower bounds on f and f + h derived above, and our bound on the residual |h|; the fourth applies the bound 0 ≤ f (ν) ≤ 1 -1 2 c 2 ν 2 to conclude 2 -1 2 c 2 ν 2 ≥ 1 on I 2 , and cancels a factor of ν in the second term; and in the last line, we apply ν ∈ I 2 to get ν ≥ 2 C 1 /c 2 e -c1n/2 , which allows us to cancel the ν -1 factor in the first term of the previous line. To wrap up, we can choose the largest of the constants appearing in the bounds derived for I 1 and I 2 above and then conclude, since I 1 ∪ I 2 = [0, π] under our condition on n.

E.3.2 PROVING LEMMA E.6

Lemma E.8. There exist absolute constants c, c , C, C , C > 0 such that if n ≥ C and if n is sufficiently large to satisfy the hypotheses of Lemma E.15, one has 1 -C ν 2 -C e -c n ≤ E g1,g2 [X ν ] ≤ 1 -cν 2 + C e -c n . Proof. By the triangle inequality, we have |cos ϕ(ν)| -|E[X ν ] -cos ϕ(ν)| ≤ E[X ν ] ≤ |cos ϕ(ν)| + |E[X ν ] -cos ϕ(ν)|. Applying Lemmas E.14 and E.15 with m = 0, we get 1 -C ν 2 -Ce -c n -C ν 2 /n ≤ E[X ν ] ≤ 1 -cν 2 + Ce -c n + C ν 2 /n, which proves the claim if we choose n ≥ 2C /c. Lemma E.9. There exist absolute constants c, C, C > 0 such that if n satisfies the hypotheses of Lemmas E.11 and E.12, then one has for each ν ∈ [0, π] Var[X ν ] ≤ Cν 4 log n n + C e -cn . Proof. We use the following elementary fact for a random variable with finite first and second moments, easily proved using Var[X ν ] = E[X 2 ν ] -E[X ν ] 2 and Fubini's theorem: in this setting one has Var[X ν ] = E g1 [Var[X ν (g 1 , • )]] + Var[E g2 [X ν ( • , g 2 )]]. By Lemma E.11, there is an event E of probability at least 1 -Ce -cn on which Var[X ν (g 1 , • )] ≤ C ν 4 /n + C e -c n . Invoking as well Lemma E.12, we obtain Var[X ν ] ≤ E g1 [(1 E + 1 E c )Var[X ν (g 1 , • )]] + C ν 4 log n n + C e -c n ≤ Cν 4 log n n + C e -cn + P[E c ] 1/2 E g1 Var[X ν (g 1 , • )] 2 1/2 ≤ Cν 4 log n n + C e -cn , as claimed, where in the second line we applied nonnegativity of the variance and the Schwarz inequality, and in the third line we used the fact that X L 2 ≤ X L ∞ for any random variable X in L ∞ . Lemma E.10. There exist absolute constants c, c , C, C , C > 0 and absolute constants K, K > 0 such that for any d ≥ 1 such that n and d satisfy the hypotheses of Lemmas E.11 and E.13 and n ≥ max{K log n, K d}, for any ν ∈ [0, π], one has P X ν -E g1,g2 [X ν ] ≤ C ν 2 d log n n + Ce -cn ≥ 1 -C n -c d . Proof. By Lemma E.11, we have P X ν -E g2 [X ν ] ≤ C ν 2 d n + Ce -cn ≥ 1 -C e -c d . Let ψ = ψ 0.25 denote the cutoff function defined in Lemma E.31, and write Y ν (g 1 , g 2 ) = v 0 (g 1 , g 2 ) ψ ( v 0 (g 1 , g 2 ) 2 ) , v ν (g 1 , g 2 ) ψ ( v ν (g 1 , g 2 ) 2 ) . By Lemma E.13, we have P E g2 [Y ν ] -E g1,g2 [Y ν ] ≤ C ν 2 d log n n + Cne -cn ≥ 1 -C n -c d We have X ν = Y ν on the event E 1 , by Lemma E.16, and we thus calculate using the triangle inequality E g1,g2 [Y ν ] -E g1,g2 [X ν ] ≤ E g1,g2 [|X ν -Y ν |] = E g1,g2 1 E c 1 Y ν ≤ Cne -cn , where the last inequality uses Hölder's inequality and the measure bound in Lemma E.16. Again using the triangle inequality, we have E g2 [X ν ] -E g2 [Y ν ] ≤ E g2 [|X ν -Y ν |], and so using our previous calculation and Markov's inequality, we can assert P E g2 [Y ν ] -E g2 [X ν ] ≤ Cne -cn/2 ≥ 1 -e -cn/2 . The claim then follows from the triangle inequality, a union bound, and a choice of n larger than an absolute constant multiple of log n and an absolute constant multiple of d. Lemma E.11. There exist absolute constants c, c , c , c , C, C , C , C , C 4 , C 5 > 0, and absolute constants K, K > 0 such that for any d ≥ 1, if n ≥ max{Kd, K log n}, then for every ν ∈ [0, π] one has with probability at least 1 -Ce -cn Var[X ν (g 1 , • )] ≤ C 4 ν 4 n + C e -c n , and with (g 1 , g 2 ) probability at least 1 -C e -c d one has X ν -E g2 [X ν ] ≤ C 5 ν 2 d n + C e -c n . Proof. Fix ν ∈ [0, π]. Let E 1 = E 0. 5,1 denote the event in Lemma E.16 which is in the definition of X ν . We start by treating the case of ν = 0 or ν = π. We have X π = 0 deterministically, so the variance is zero and it equals its partial expectation over g 2 with probability one. For the other case, one has X 0 = 1 E1 ; we have Var[X 0 (g 1 , • )] = E g2 [1 E1 ] -E g2 [1 E1 ] 2 ≤ 1 -E g2 [1 E1 ] , and since E[1 E1 ] = 1 -Cne -cn by Lemma E.16, we obtain by Markov's inequality P Var[X 0 (g 1 , • )] ≥ Cne -cn/2 ≤ e -cn/2 . This gives a suitable bound on the variance with suitable probability. For deviations, we note that E X 0 -E g2 [X 0 ] = 0, and following our previous variance inequality but taking expectations over both g 1 and g 2 gives Var[X 0 ] ≤ Cne -cn , which implies by Chebyshev's inequality P X 0 -E g2 [X 0 ] ≥ Cne -cn/2 ≤ e -cn/2 which is a suitable deviations bound that we can union bound with the event constructed below, which controls deviations uniformly for the remaining values of ν. We therefore assume below that 0 < ν < π. Let ψ(x) = max{x, 1 8 }, which is continuous and differentiable except at x = 1 8 , with derivative ψ (x) = 1 x>1/8 . We note in addition that x ≤ ψ(x), and since ψ ≥ 1 8 we have for x ≥ 0 the bound x/ψ(x) ≤ 1. Define Y ν (g 1 , g 2 ) = v 0 (g 1 , g 2 ), v ν (g 1 , g 2 ) ψ( v 0 (g 1 , g 2 ) 2 )ψ( v ν (g 1 , g 2 ) 2 ) . We first show that it is enough to prove the claims for Y ν , which will be preferable for technical reasons. On E 1 , we have Y ν = X ν . We have |Y ν | ≤ 1, and we calculate E g2 (Y ν -X ν ) 2 = E g2 1 E c 1 (Y ν -X ν ) 2 ≤ E g2 1 E c 1 1/2 E g2 (Y ν -X ν ) 4 1/2 ≤ C E g2 1 E c 1 1/2 , where the first inequality uses the Schwarz inequality, and the last inequality uses that |X ν | ≤ 1 and the triangle inequality, and where C > 0 is an absolute constant. We have by Tonelli's theorem and Lemma E.16 E g1 E g2 1 E c 1 1/2 ≤ Cne -cn , so Markov's inequality implies P E g2 1 E c 1 1/2 ≥ Cne -cn/2 ≤ e -cn/2 . Thus, with probability at least 1 -e -cn/2 , we have E g2 (Y ν -X ν ) 2 ≤ C ne -cn/2 , so that an application of Lemma E.32 yields that with probability at least 1 -e -cn/2 Var[X ν (g 1 , • )] ≤ Var[Y ν (g 1 , • )] + C ne -c n , where we have worst-cased constants and the exponent on n. For deviations, we write using the triangle inequality X ν -E g2 [X ν ] ≤ |X ν -Y ν | + Y ν -E g2 [Y ν ] + E g2 [Y ν ] -E g2 [X ν ] , and then note that the first term is identically zero on the event E 1 , which has probability at least 1 -Ce -cn , whereas for the third term, we have E g2 [Y ν ] -E g2 [X ν ] ≤ E g2 (Y ν -X ν ) 2 1/2 ≤ C ne -cn/2 , where the first inequality uses the triangle inequality and the Lyapunov inequality, and the second inequality holds with probability at least 1 -e -cn/2 , and leverages the argument we used to control the difference in variances. Ultimately taking union bounds, we can conclude that it sufficient to prove the claimed properties for Y ν . With 0 < ν < π fixed, we introduce the notation u g1 = v 0 (g 1 ); v g1,g2 = v ν (g 1 , g 2 ), so that Y ν = u g1 ψ ( u g1 2 ) , v g1,g2 ψ ( v g1,g2 2 ) . For fixed g 1 , we will write Y ν (g 2 ) = Y ν (g 1 , g 2 ) with an abuse of notation. For ḡ ∈ R n arbitrary and g 2 fixed, we consider the function f (t) = Y ν (g 2 + tḡ) for t ∈ [0, 1]. Writing f for the derivative of f where it exists, at any point of differentiability, we calculate by the chain rule f (t) = ∇ g2 Y ν (g 2 + tḡ), ḡ , where ∇ g2 Y ν (g 2 ) = sin ν ψ( u g1 2 )ψ( v g1,g2 2 ) I - ψ ( v g1,g2 2 )v g1,g2 v * g1,g2 ψ( v g1,g2 2 ) v g1,g2 2 1 vg 1 ,g 2 >0 u g1 . Using the fact that 1 vg 1 ,g 2 >0 u g1 = P {vg 1 ,g 2 >0} u g1 , where P {vg 1 ,g 2 >0} is the orthogonal projection onto the coordinates where v g1,g2 is positive, together with the fact that v g1,g2 v * g1,g2 P {vg 1 ,g 2 >0} = P {vg 1 ,g 2 >0} v g1,g2 v * g1,g2 , we can also write ∇ g2 Y ν (g 2 ) = sin ν ψ( u g1 2 )ψ( v g1,g2 2 ) P {vg 1 ,g 2 >0} I - ψ ( v g1,g2 2 )v g1,g2 v * g1,g2 ψ( v g1,g2 2 ) v g1,g2 2 u g1 . (E.8) We next argue that f does not fail to be differentiable at too many points of [0, 1]. Because ψ > 0, it will suffice to show that (i) t → v g1,g2+tḡ and (ii) t → ψ( v g1,g2+tḡ 2 ) are differentiable at all but a set of isolated points in [0, 1]. For the latter function, we note that at any point where v g1,g2+tḡ 2 < 1 8 , by continuity we have that t → ψ( v g1,g2+tḡ 2 ) is locally constant, and therefore differentiable at such points. At other points, by Lemma E.21 it suffices to characterize t → v g1,g2+tḡ 2 as differentiable at all but isolated points, and because v g1,g2+tḡ 2 ≥ 1 8 by assumption, the norm is differentiable and by the chain rule it suffices to characterize differentiability of each coordinate of t → v g1,g2+tḡ , which settles the question of all-but-isolated differentiability of (i) as well. We have v g1,g2+tḡ = σ(g 1 cos ν + g 2 sin ν + tḡ sin ν), so again by Lemma E.21, we conclude from differentiability of t → g 1 cos ν + g 2 sin ν + tḡ sin ν that t → v g1,g2+tḡ is differentiable at all but isolated points, and consequently so is f . In particular, f is differentiable at all but countably many points of [0, 1]. Next, we show that f has suitable integrability properties. Indeed, we calculate using (E.8) ∇ g2 Y ν (g 2 ) 2 ≤ 8ν I - ψ ( v g1,g2 2 )v g1,g2 v * g1,g2 ψ( v g1,g2 2 ) v g1,g2 2 u g1 ψ( u g1 2 ) 2 = 8ν 1 -ψ ( v g1,g2 2 )Y ν (g 2 ) 2 , (E.9) where we used Cauchy-Schwarz and ψ ≥ 1 8 in the first inequality and distributed and applied (ψ ) 2 = ψ and the estimate x/ψ(x) ≤ 1 in the second inequality. In particular, this implies that |f (t)| ≤ C ḡ 2 , which is a t-integrable upper bound for every ḡ. Because Y ν (g 1 , • ) is continuous by continuity of σ, ψ, and the fact that ψ becomes constant whenever v g1,g2 2 < 1 8 , we can apply (Cohn, 2013, Theorem 6.3.11) to get Y ν (g 2 + ḡ) = Y ν (g 2 ) + 1 0 ∇ g2 Y ν (g 2 + tḡ), ḡ dt, and since ḡ was arbitrary, for any g 2 ∈ R n we can put ḡ = g 2 -g 2 to get Y ν (g 2 ) = Y ν (g 2 ) + 1 0 ∇ g2 Y ν (tg 2 + (1 -t)g 2 ), g 2 -g 2 dt. Performing the expansion with g 2 and g 2 reversed and applying the triangle inequality and Cauchy-Schwarz then implies the estimate |Y ν (g 2 ) -Y ν (g 2 )| ≤ g 2 -g 2 2 1 0 ∇ g2 Y ν (tg 2 + (1 -t)g 2 ) 2 dt. (E.10) This relation is enough to conclude the result for angles satisfying ν ≥ c 0 , where 0 < c 0 ≤ π/4 is an absolute constant. Indeed, (E.9) and (E.10) imply that Y ν is C-Lipschitz, where C > 0 is an absolute constant; so the Gaussian Poincaré inequality implies E g2 Y ν -E g2 [Y ν ] ≤ C n , and Gauss-Lipschitz concentration implies for any d ≥ 0 P Y ν -E g2 [Y ν ] ≥ C d n ≤ 2e -d . Because ν ≥ c 0 , we can adjust these bounds to involve ν 4 and ν 2 (respectively) while only paying increases in the constant factors. We proceed assuming 0 < ν ≤ c 0 . Let 0 ≤ τ g1 ≤ 1 denote a median of Y ν (g 1 , • ), i.e., a number satisfying Pg 2 [Y ν ≥ τ g1 ] ≥ 1 2 and Pg 2 [Y ν ≤ τ g1 ] ≥ 1 2 , and for each 0 ≤ s < τ g1 define w s (g 2 ) = max{Y ν (g 2 ), τ g1 -s}. For any 0 ≤ s < τ g1 , notice that w s ≥ Y ν , which implies that P[w s ≥ τ g1 ] ≥ P[Y ν ≥ τ g1 ] ≥ 1 2 , because τ g1 is a median of Y ν ; and similarly P[w s ≤ τ g1 ] ≥ P[Y ν ≤ τ g1 ] ≥ 1 2 , so that τ g1 is also a median of w s . The fact that w s ≥ Y ν implies for any t > 0 that P[Y ν -τ g1 > t] ≤ P[w s -τ g1 > t], and additionally if Y ν ≤ τ g1 -s we have w s = τ g1 -s, so that P[Y ν -τ g1 ≤ -s] ≤ P[w s -τ g1 ≤ -s]. In particular, the tails of Y ν can be controlled in terms of those of w s for appropriate choices of s. Additionally, by Lemma E.21, we have that for each s, t → w s (g 2 + tḡ) is differentiable at all but countably many points of [0, 1], and has derivative there equal to t → ḡ, ∇ g2 w s (g 2 ) , where ∇ g2 w s (g 2 ) = 1 ws(g2)>τg 1 -s ∇ g2 Y ν (g 2 ), which, following from (E.9), satisfies a strengthened gradient norm estimate ∇ g2 w s (g 2 ) 2 ≤ 8ν1 ws(g2)>τg 1 -s 1 -ψ ( v g1,g2 2 )Y ν (g 2 ) 2 ≤ 8ν 1 -ψ ( v g1,g2 2 )(τ g1 -s) 2 . (E.11) In particular, we obtain a nearly-Lipschitz estimate of the form (E.10): |w s (g 2 ) -w s (g 2 )| ≤ g 2 -g 2 2 1 0 8ν 1 -ψ ( v g1,tg 2 +(1-t)g2) 2 )(τ g1 -s) 2 dt. (E.12) For each g 1 , we define a set S g1 = {g 2 | v g1,g2 2 ≥ 1 4 }. Noting that the function g 2 → σ(g 1 cos ν + g 2 sin ν) 2 is a convex 1-Lipschitz function (given that |sin ν| ≤ 1), we have by Gauss-Lipschitz concentration P g2 v g1,g2 2 ≤ E g2 [ v g1,g2 2 ] -t ≤ e -cnt 2 , and by Jensen's inequality E g2 [ v g1,g2 2 ] ≥ |cos ν| u g1 2 ≥ u g1 2 √ 2 , where the last line holds because ν ≤ π/4. By Lemma E.16, there is a g 1 event E having probability at least 1 -Ce -cn on which u g1 2 ≥ 1 2 , so that for any g 1 ∈ E, we have by a suitable choice of t in our Gauss-Lipschitz bound Pg 2 [S g1 ] ≥ 1-e -cn . Thus, using the first line of (E.11), the Gaussian Poincaré inequality and the Lipschitz property of w s (which follows from (E.12) after bounding by an absolute constant) and Rademacher's theorem on a.e. differentiability of Lipschitz functions, we have whenever g 1 ∈ E Var[w s ] ≤ 2 n E g2 ∇ g2 w s (g 2 ) 2 2 ≤ 128ν 2 n E g2 1 -ψ ( v g1,g2 2 )Y ν (g 2 ) 2 ≤ 128ν 2 n E g2 1 E 1 -ψ ( v g1,g2 2 )Y ν (g 2 ) 2 + E g2 1 E c 1 -ψ ( v g1,g2 2 )Y ν (g 2 ) 2 ≤ 256ν 2 n E g2 [1 -Y ν (g 2 )] + Ce -cn , (E.13) where we also make use of the fact that 0 ≤ Y ν ≤ 1. Now, we calculate for g 1 ∈ E and g 2 ∈ S g1 Y ν = 1 - 1 2 u g1 u g1 2 - v g1,g2 v g1,g2 2 2 2 ≥ 1 -2 u g1 -v g1,g2 2 2 u g1 2 2 ≥ 1 -8 u g1 -v g1,g2 2 2 ≥ 1 -8 g 1 -(g 1 cos ν + g 2 sin ν) 2 2 , (E.14) where the second inequality uses g 1 ∈ E, the third uses nonexpansiveness of σ, and the first requires a proof; we will show that for any nonzero vectors x, y ∈ R n , one has x x 2 - y y 2 2 ≤ 2 x -y 2 y 2 . (E.15) To see this, write θ for the angle between x and y, and distribute to obtain equivalently - 1 2 y 2 2 (1 + cos θ) ≤ x 2 2 -2 x 2 y 2 cos θ. Divide through by x 2 2 , write K = y 2 x -1 2 , and rearrange to obtain the equivalent expression K 2 (1 + cos θ)) -4K cos θ + 2 ≥ 0. It suffices to minimize the LHS of the previous inequality with respect to K subject to the constraint K > 0 and then study the resulting function of θ to ascertain the validity of the bound. Given that 1 + cos θ ≥ 0, the LHS is a convex function of K, with minimizer K = 2 cos θ(1 + cos θ) -1 , and therefore for any θ ≥ π/2, the LHS subject to the constraint K > 0 is minimized at K = 0, where the inequality is easily seen to be true. If θ < π/2, we have that the minimizer is positive, and we verify that after substituting the bound becomes 1 + cos θ ≥ 2 cos 2 θ, which is also seen to be true for θ < π/2, for example by showing that the polynomial x → -2x 2 + x + 1 is nonnegative on [0, 1]. This proves the inequality, so returning to (E.14), we have Y ν ≥ 1 -8((1 -cos ν) 2 g 1 2 2 + sin 2 ν g 2 2 2 -2(sin ν)(1 -cos ν) g 1 , g 2 ) ≥ 1 -8((1 -cos ν) 2 g 1 2 2 + sin 2 ν g 2 2 2 -2(sin ν)(1 -cos ν) g 1 2 g 2 2 ) using Cauchy-Schwarz in the second inequality. By Gauss-Lipschitz concentration (e.g. following the proof of the third assertion in Lemma E.17), there is a g 1 event E and a g 2 event E , each with probability at least 1 -Ce -cn , on which we have (respectively) g i 2 ≤ 2 for i = 1, 2. Then using (sin ν)(1 -cos ν) ≥ 0, we obtain that when g 1 ∈ E ∩ E and when g 2 ∈ S g1 ∩ E Y ν ≥ 1 -32((1 -cos ν) 2 + sin 2 ν) = 1 -64(1 -cos ν) ≥ 1 -32ν 2 , where the final inequality uses the standard estimate cos ν ≥ 1 -0.5ν 2 , which can be proved via Taylor expansion. By a union bound, we can assert that with g 1 -probability at least 1 -Ce -cn , with conditional (in g 2 ) probability at least 1 -C e -c n we have Y ν ≥ 1 -32ν 2 , so that in particular, by nonnegativity of Y ν , and choosing n larger than an absolute constant, we guarantee with g 1probability at least 1 - Ce -cn E g2 [Y ν ] ≥ 1 -32ν 2 -C e -cn , τ g1 ≥ 1 -32ν 2 . (E.16) Plugging the mean estimate into (E.13), we conclude with probability at least 1 -C e -c n Var[w s ] ≤ Cν 4 n + C e -cn . (E.17) We could have just as well applied this exact argument to Y ν instead of w s , so we conclude the claimed variance bound from this expression. We have stated the result in terms of the truncations w s so that it can be applied towards deviations control in the sequel. As an immediate application, we use the fact that any median is a minimizer of the quantity c → E[|X -c|] for any integrable X and c ∈ R to get with probability at least 1 -C e -c n E g2 [w s ] -τ g1 ≤ E g2 [|w s -τ g1 |] ≤ E g2 w s -E g2 [w s ] ≤ Var[w s ] ≤ Cν 2 √ n + C e -cn , (E.18) where we also applied Jensen's inequality for the first inequality and the Lyapunov inequality for the third. In particular, the same argument yields E g2 [Y ν ] -τ g1 ≤ Cν 2 √ n + C e -cn . (E.19) We turn to removing the t dependence in (E.12) without sacrificing the dependence on τ g1 . To obtain a Lipschitz estimate on the subset S g1 we need to control the norm of ∇ g2 w s on the line segment between g 2 , g 2 ∈ S g1 . For this, write σ y (x) = max{x -y, 0} for any y ∈ R, and make the following observations: 1. v g1,g2 = (sin ν)σ -g1 cot ν (g 2 ), so that Y ν (g 2 ) = u g1 ψ ( u g1 2 ) , (sin ν)σ -g1 cot ν (g 2 ) ψ ( (sin ν)σ -g1 cot ν (g 2 ) 2 ) ; 2. for any x, y, σ y (x) = max{x, y} -y; x → max{x, y} is the projection onto the convex set {x | x i ≥ y i ∀i}, so in particular x → σ y (x) is nonexpansive, has convex range, and satisfies σ y (σ y (x) + y) = σ y (x); and thus 3. for any g 2 , Y ν (g 2 ) = Y ν (σ -g1 cot ν (g 2 ) -g 1 cot ν). We write g2 = σ -g1 cot ν (g 2 ) -g 1 cot ν, g 2 = σ -g1 cot ν (g 2 ) -g 1 cot ν, so that (E.12) becomes |w s (g 2 ) -w s (g 2 )| = |w s (g 2 ) -w s (g 2 )| ≤ g 2 -g2 2 1 0 8ν 1 -ψ ( v g1,tg 2 +(1-t)g2) 2 )(τ g1 -s) 2 dt ≤ g 2 -g 2 2 1 0 8ν 1 -ψ ( v g1,tg 2 +(1-t)g2) 2 )(τ g1 -s) 2 dt, where the second line follows from nonexpansiveness and translation invariance of the distance. Having reduced to the study of points along the segment between g2 and g 2 , we now observe σ -g1 cot ν (tg 2 + (1 -t)g 2 ) = σ (tσ -g1 cot ν (g 2 ) + (1 -t)σ -g1 cot ν (g 2 )) = tσ -g1 cot ν (g 2 ) + (1 -t)σ -g1 cot ν (g 2 ), because σ -g1 cot ν has image included in the nonnegative orthant, which is convex. It then follows from (1) above that v g1,tg 2 +(1-t)g2 2 = (sin ν) tσ -g1 cot ν (g 2 ) + (1 -t)σ -g1 cot ν (g 2 ) 2 = tv g1,g2 + (1 -t)v g1,g 2 2 , and in particular tv g1,g2 + (1 -t)v g1,g 2 2 2 = t 2 v g1,g2 2 2 + 2t(1 -t) v g1,g2 , v g1,g 2 + (1 -t) 2 v g1,g 2 2 2 ≥ 1 16 t 2 + (1 -t) 2 ≥ 1 32 , where the first inequality uses that σ ≥ 0 and g 2 , g 2 ∈ S g1 , and the second minimizes the convex function of t in the previous bound. We conclude that g 2 , g 2 ∈ S g1 implies that v g1,tg 2 +(1-t)g2 2 > 1 8 for every t ∈ [0, 1], and consequently (E.12) becomes (after an additional simplification of the quantity under the square root using τ g1 ≤ 1) |w s (g 2 ) -w s (g 2 )| ≤ 16ν 1 -(τ g1 -s) g 2 -g 2 2 , (E.20) so that w s is 16ν 1 -(τ g1 -s)-Lipschitz on S g1 . Then by an application of the median bound in (E.16), if 0 ≤ s < 1 -32ν 2 , with g 1 probability at least 1 -Ce -cn we have that w s is 16ν √ 32ν 2 + s-Lipschitz on S g1 . For the previous assertion to be nonvacuous, we need to take ν small; in particular, we have 1 -32ν 2 > 1 2 if ν < 1/8, which we can take to be the value of the absolute constant c 0 we left unspecified previously. Then for each such s, define L s = 16ν √ 32ν 2 + s, and define ŵs (g 2 ) = sup g 2 ∈Sg 1 {w s (g 2 ) -L s g 2 -g 2 2 }. Then ŵs is L s -Lipschitz on R n , and satisfies ŵs = w s on S g1 (Evans & Gariepy, 1991, §3.1.1 Theorem 1). By the Gaussian Poincaré inequality, we obtain immediately Var[ ŵs ] ≤ L s , and using ŵs = w s on S g1 , we compute E g2 [w s -ŵs ] = E g2∈S c g 1 [w s -ŵs ] ≤ E g2 1 S c g 1 |w s -ŵs | ≤ P g2 S c g1 1/2 w s -ŵs L 2 ≤ Ce -cn ( w s L 2 + ŵs L 2 ) ≤ C e -cn , (E.21) where the second inequality follows from the Schwarz inequality, the third holds given that g 1 ∈ E and by the Minkowski inequality, and the final uses that w s and ŵs are both Lipschitz with Lipschitz constants bounded above by absolute constants together with the Gaussian Poincaré inequality. Meanwhile, by Gauss-Lipschitz concentration, we obtain a Bernstein-type lower tail P ŵs ≤ E g2 [ ŵs ] -s ≤ exp - cns 2 ν 2 (32ν 2 + s) , (E.22) and for the upper tail, it will be sufficient to consider ŵ0 , which satisfies a subgaussian tail (for any t ≥ 0) P ŵ0 ≤ E g2 [ ŵ0 ] -t ≤ exp - c nt 2 ν 4 . (E.23) Using the results (E.18), (E.19), (E.21), and the fact that w s = ŵs on S g1 , we get P g2 Y ν -E g2 [Y ν ] ≤ -s ≤ P g2 ŵs -E g2 [ ŵs ] ≤ C ν 2 √ n + C e -cn -s + P g2 S c g1 . (E.24) Using d ≥ 1, we put s = 2Cν 2 d/n + C e -cn in this bound; using that ν < 1/8, and in particular 1 -32ν 2 > 1 2 , we can choose n larger than an absolute constant multiple of d to guarantee that for all 0 ≤ ν < 1/8, this choice of s is less than 1 -32ν 2 , and that Cν 2 d/n ≤ 32ν 2 . Together with the lower tail bound (E.22), these facts imply P g2 Y ν -E g2 [Y ν ] ≤ -2Cν 2 d n -C e -cn ≤ P g2 ŵs -E g2 [ ŵs ] ≤ -Cν 2 d n + C e -c n ≤ e -c d + C e -c n . Meanwhile, for the upper tail, we have for any t ≥ 0 P g2 Y ν -E g2 [Y ν ] ≥ t ≤ P g2 ŵ0 -E g2 [ ŵ0 ] ≥ t -C ν 2 √ n -C e -cn + P g2 S c g1 , and if we put t = 2Cν 2 d/n + C e cn , our previous requirements on n and the upper tail bound (E.23) yield P g2 Y ν -E g2 [Y ν ] ≥ 2Cν 2 d n + C e -cn ≤ e -c d + C e -c n . Combining these two bounds gives control of absolute deviations about the mean. By independence, we conclude P g1,g2 Y ν -E g2 [Y ν ] ≤ 2Cν 2 d n + C e -c n ≥ (1 -2e -cd -Ce -c n )(1 -C e -c n ) ≥ 1 -2e -cd -Ce -c n -C e -c n . To conclude, we have shown that for every ν ∈ [0, π] one has with probability at least 1 -Ce -cn Var[X ν (g 1 , • )] ≤ C ν 4 n + C ne -c n , and with (g 1 , g 2 ) probability at least 1 -2e -c d + C ne -c n one has X ν -E g2 [X ν ] ≤ C ν 2 d n + C ne -c n . To simplify these bounds, we may in addition choose n larger than an absolute constant multiple of log n, and n larger than an absolute constant multiple of d, to obtain that with probability at least 1 -Ce -cn Var[X ν (g 1 , • )] ≤ C 4 ν 4 n + C e -c n , and with (g 1 , g 2 ) probability at least 1 -C e -c d one has X ν -E g2 [X ν ] ≤ C 5 ν 2 d n + C e -c n , which was to be shown. Lemma E.12. There exist absolute constants c, C, C > 0 and an absolute constant K > 0 such that if n ≥ K log 4 n, then for every ν ∈ [0, π] one has Var E g2 [X ν ( • , g 2 )] ≤ Cν 4 log n n + C e -cn . Proof. Define Y ν (g 1 , g 2 ) = v 0 (g 1 , g 2 ), v ν (g 1 , g 2 ) ψ( v 0 (g 1 , g 2 ) 2 )ψ( v ν (g 1 , g 2 ) 2 ) , where ψ = ψ 0.25 is as in Lemma E.31. Then by Cauchy-Schwarz and property 2 in Lemma E.31 (the case where either v 0 2 = 0 or v ν 2 = 0 is treated separately, since in this case Y ν = 0), we obtain |Y ν | ≤ 4, and E g1 E g2 [X ν ] -E g2 [Y ν ] 2 = E g1 E g2 1 E c v 0 , v ν ψ( v 0 2 )ψ( v ν 2 ) 2 ≤ E g1,g2 1 E c v 0 , v ν ψ( v 0 2 )ψ( v ν 2 ) 2 ≤ 16µ(E c m ) ≤ Cne -cn , where we use the fact that if (g 1 , g 2 ) ∈ E m then v ν 2 ≥ 1 2 for every 0 ≤ ν ≤ π and hence ψ( v ν 2 ) = v ν 2 in the first line, apply Jensen's inequality in the second line, and combine our bound on Y ν with Hölder's inequality and the measure bound in Lemma E.16 in the third line. An application of Lemma E.32 then yields Var E g2 [X ν ( • , g 2 )] ≤ Var E g2 [Y ν ( • , g 2 )] + Cne -cn ≤ Var E g2 [Y ν ( • , g 2 )] + Ce -cn/2 , where the last inequality holds when n is chosen to be larger than an absolute constant multiple of log n. It thus it suffices to control the variance of Y ν . Applying Lemma E.26, we get for almost all g 1 ∈ R n E g2 [Y ν (g 1 , g 2 )] = v 0 2 2 ψ( v 0 2 ) 2 + ν 0 t 0 E g2 [(Ξ 1 + Ξ 2 + Ξ 3 + Ξ 4 + Ξ 5 + Ξ 6 )(s, g 1 , g 2 )] ds dt, where we follow the notation defined in Lemma E.13. We start by removing the term outside of the integral from consideration. We have as above |Y ν | ≤ 4, so that |E g2 [Y ν ]| ≤ 4. Moreover, following the proof of the measure bound in Lemma E.16, but using only the pointwise concentration result, we assert that if n ≥ C an absolute constant there is an event E on which 0.5 ≤ v 0 2 ≤ 2 with probability at least 1 -2e -cn with c > 0 an absolute constant. This implies that if g 1 ∈ E we have v 0 2 2 ψ( v 0 2 ) 2 = 1, and since v 0 2 2 ψ( v 0 2 ) 2 ≤ 4, by the same argument used for Y ν , we can calculate v 0 2 2 ψ( v 0 2 ) 2 -1 L 2 ≤ v 0 2 2 ψ( v 0 2 ) 2 -1 1 E c L 2 ≤ 5 1 E c L 2 ≤ Ce -cn , by the Minkowski inequality and the triangle inequality. An application of Lemma E.32 implies that it is therefore sufficient to control the variance of the quantity f (ν, g 1 ) = 1 + ν 0 t 0 E g2 [(Ξ 1 + Ξ 2 + Ξ 3 + Ξ 4 + Ξ 5 + Ξ 6 )(s, g 1 , g 2 )] ds dt. By Lemma E.37, the Lyapunov inequality, and Fubini's theorem, we have (f (ν, g 1 ) -E[f (ν, g 1 )]) 2 = ν 0 t 0 6 i=1 E g2 [Ξ i (s, g 1 , g 2 )] -E g1,g2 [Ξ i (s, g 1 , g 2 )] ds dt Using the elementary inequality ν 0 t 0 g(s) ds dt 2 ≤ ν ν 0 t dt t 0 g 2 (s) ds, valid for any square integrable g : [0, π] → R and proved with two applications of Jensen's inequality, and Lemma E.37, we obtain (f (ν, g 1 ) -E[f (ν, g 1 )]) 2 ≤ ν ν 0 t t 0 6 i=1 E g2 [Ξ i (s, g 1 , g 2 )] -E g1,g2 [Ξ i (s, g 1 , g 2 )] 2 ds dt. Thus, again by Lemma E.37, the Lyapunov inequality, Fubini's theorem, and compactness of [0, π], we have Var[f (ν, • )] ≤ ν ν 0 t t 0 Var 6 i=1 E g2 [Ξ i (s, g 1 , g 2 )] ds dt. (E.26) We can control the variance under the integral using a combination of Lemmas E.35 and E.37, together with the deviations control given by Lemmas E.39, E.41 to E.44 and E.46, since we have chosen n according to the hypotheses of Lemma E.13. In particular, these lemmas furnish deviation bounds of size at most C i d log n n + n -cid + C i ne -c i n that hold with probabilities at least 1-C i n -c i d -C i ne -c i n , for any d ≥ 1 larger than an absolute constant and suitable absolute constants specified above. We can simplify these bounds as follows: first, choose n such that n ≥ (2/c i ) log n for each i, which guarantees that the bounds hold with probability at least 1 -C i n -c i d -C i e -c i n/2 . Next, choose n ≥ (2c i /c i )d log n for all i, which implies that the bounds hold with probability at least 1 -2 max{C i , C i }n -c i d . Similarly, we also choose n such that n ≥ (2/c i ) log n for each i, which guarantees that the error terms that are exponential in n in the bounds are upper bounded by C i e -c i n/2 , and, choose n ≥ (2c i /c i )d log n for all i, which implies that for all i C i d log n n + n -cid + C i ne -c i n ≤ C i d log n n + 2 max{C i , C i }n -cid . Finally, we make the particular choice d = 4/ min i {c i , c i }, or the minimum required value of d, whichever is larger, so that there are absolute constants C, C , C > 0 such that with probability at least 1 -C n -4 we have for all i E g2 [Ξ i (ν, g 1 , g 2 )] -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C log n n + C n -4 ≤ 2C log n n , where the last inequality holds when n is larger than an absolute constant. With these bounds, we can now invoke Lemma E.35 with Lemma E.37 to get Var 6 i=1 E g2 [Ξ i (s, g 1 , g 2 )] ≤ C log n n + C n 2 ≤ C log n n , for different absolute constants C, C , C > 0, and where the last inequality again holds n is larger than an absolute constant. Plugging back into (E.26) and evaluating the integrals, we get Var[f (ν, • )] ≤ Cν 4 log n n , which is enough to conclude. Lemma E.13. Write Y ν (g 1 , g 2 ) = v 0 (g 1 , g 2 ), v ν (g 1 , g 2 ) ψ( v 0 (g 1 , g 2 ) 2 )ψ( v ν (g 1 , g 2 ) 2 ) , where ψ = ψ 0.25 is as in Lemma E.31. There exist absolute constants c, c , C, C , C > 0 and absolute constants K, K > 0 such that for any d ≥ 1, if n ≥ Kd 4 log 4 n and if d ≥ K , then there is an event E such that 1. One has ∀ν ∈ [0, π], E g2 [Y ν ] -E g1,g2 [Y ν ] ≤ C ν 2 d log n n + Ce -cn if g 1 ∈ E; 2. One has P[E] ≥ 1 -C n -c d . Proof. Fix d > 0, and write f (ν, g 1 ) = E g2 [Y ν (g 1 , g 2 )]. Applying Lemma E.26, we get for almost all g 1 ∈ R n f (ν, g 1 ) = v 0 2 2 ψ( v 0 2 ) 2 + ν 0 t 0 E g2 [(Ξ 1 + Ξ 2 + Ξ 3 + Ξ 4 + Ξ 5 + Ξ 6 )(s, g 1 , g 2 )] ds dt, (E.27) where Ξ 1 (s, g 1 , g 2 ) = n i=1 σ(g 1i ) 3 ρ(-g 1i cot s) ψ( v 0 2 )ψ( v i s 2 ) sin 3 s Ξ 2 (s, g 1 , g 2 ) = v 0 , v s ψ ( v s 2 ) v s 2 ψ( v 0 2 )ψ( v s 2 ) 2 - v 0 , v s ψ( v 0 2 )ψ( v s 2 ) Ξ 3 (s, g 1 , g 2 ) = - v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 2 Ξ 4 (s, g 1 , g 2 ) = -2 v 0 , vs v s , vs ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 Ξ 5 (s, g 1 , g 2 ) = - v 0 , v s vs 2 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 Ξ 6 (s, g 1 , g 2 ) = 2 v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 3 v s 2 2 + v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 3 2 . Here we put Ξ 1 (0, g 1 , g 2 ) = Ξ 1 (π, g 1 , g 2 ) = 0, which does not affect the integral and which is equal to the limits lim ν 0 Ξ 1 (ν, g 1 , g 2 ) = lim ν π Ξ 1 (ν, g 1 , g 2 ) for every (g 1 , g 2 ). Momentarily ignoring measurability issues, it is of interest to construct g 1 events E i of suitable probability on which we have sup ν∈[0,π] E g2 [Ξ i (ν, g 1 , g 2 )] -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C i d log n n + n -cid + C i ne -c i n (E.28) for each i = 1, . . . , 6, and a g 1 event E 7 on which we have v 0 2 2 ψ( v 0 2 ) 2 -E g1 v 0 2 2 ψ( v 0 2 ) 2 ≤ C 7 e -c 7 n . We can then consider the event E = 7 i=1 E i , possibly minus a negligible set on which (E.27) fails to hold, which has high probability via a union bound and on which we have simultaneously for all ν ∈ [0, π] f (ν, g 1 ) -E g1 [f (ν, g 1 )] ≤ v 0 2 2 ψ( v 0 2 ) 2 -E g1 v 0 2 2 ψ( v 0 2 ) 2 + 6 i=1 ν 0 t 0 E g2 [Ξ i (s, g 1 , g 2 )] -E g1,g2 [Ξ i (s, g 1 , g 2 )] ds dt ≤ Cν 2 d log n n + n -cd + C ne -c n , by Fubini's theorem and Lemma E.37, the triangle inequality (for | • | and for the integral), (E.28), and using ν 2 ≤ π 2 and worst-casing the remaining constants. To establish the bounds (E.28), we will employ lemma Lemma E.48, which shows that it is sufficient to obtain pointwise control and show a suitable s-Lipschitz property for each i ∈ [6]; following the lemma, these properties also imply Lebesgue measurability of the suprema immediately. Reduction to product space events. Fix ν. By the triangle inequality, we have for each i = 1, . . . , 6 E g2 [Ξ i (ν, g 1 , g 2 )] -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ E g2 Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] . (E.29) Suppose we can construct (g 1 , g 2 ) events E i such that 1. If (g 1 , g 2 ) ∈ E i , then Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C i d log n n + n -cid + C i ne -c i n ; 2. One has P[E i ] ≥ 1 -C i n -c i d -C i ne -c i n . Then for each such i, we can write E g2 Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] = E g2 1 E i + 1 (E i ) c Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C i d log n n + n -cid + C i ne -c i n + E g2 1 (E i ) c Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] (E.30) using nonnegativity of the integrand and boundedness of the indicator for E i in the second line. The random variable remaining in the second line is nonnegative, and by Fubini's theorem (with Lemma E.37 for joint integrability) and the Schwarz inequality we have E g1 E g2 1 (E i ) c Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ E g1,g2 1 (E i ) c 1/2 E g1,g2 Ξ i (ν, g 1 , g 2 ) -Eg 1,g2 [Ξ i (ν, g 1 , g 2 )] 2 1/2 ≤ C C i n -c i d + C i ne -c i n 1/2 , where the second line applies Lemma E.37 and the Lyapunov inequality. We can replace this last inequality with one equivalent to the measure bound on (E i ) c using subadditivity of the square root and reducing the constants c i and c i by a factor of 2. Using this last inequality, Markov's inequality implies for any t ≥ 0 P E g2 1 (E i ) c Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≥ Cn -1 2 c i d + C n 1/2 e -1 2 c i n ≤ Cn -1 2 c i d + C n 1/2 e -1 2 c i n , which, together with (E.29) and after worst-casing some exponents and constants, implies that there is a g 1 event E i that satisfies (the constants C and C are scoped across properties 1 and 2) 1. If g 1 ∈ E i , then E g2 [Ξ i (ν, g 1 , g 2 )] -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C i d log n n + (C i + C)n -1 2 cid + (C i + C )ne -1 2 min{c i ,c i }n ; 2. One has P[E i ] ≥ 1 -Cn -1 2 c i d -C ne -1 2 c i n . Thus, we can pass from (g 1 , g 2 ) events to g 1 events with only a worsening of constants, and it suffices to construct the events E i . Additionally, we can leverage this same framework to pass ν-uniform control from the product space to g 1 -space. Suppose we can construct (g 1 , g 2 ) events E i such that 1. If (g 1 , g 2 ) ∈ E i , then ∀ν ∈ [0, π], Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C i d log n n + n -cid + C i ne -c i n ; 2. One has P[E i ] ≥ 1 -C i n -c i d -C i ne -c i n . Then following (E.30), we can assert ∀ν ∈ [0, π], E g2 Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C i d log n n + n -cid + C i ne -c i n + E g2 1 (E i ) c Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] . To get uniform control of this last random variable, we can use Lemma E.37, which tells us that we have a bound ∀ν ∈ [0, π], E g2 Ξ i (ν, g 1 , g 2 ) -E g1,g2 [Ξ i (ν, g 1 , g 2 )] ≤ C i d log n n + n -cid + C i ne -c i n + E g2 1 (E i ) c f i (g 1 , g 2 ) , (E.31) where f i is in L 4 (R n × R n ), and has L 4 norm bounded by an absolute constant C i > 0. Then Fubini's theorem and the Schwarz inequality allow us to assert E g1,g2 1 (E i ) c f i (g 1 , g 2 ) ≤ C i E g1,g2 1 (E i ) c 1/2 , which can be controlled exactly as in the pointwise control argument. In particular, an application of Markov's inequality gives P E g2 1 (E i ) c f i (g 1 , g 2 ) ≥ Cn -1 2 c i d + C n 1/2 e -1 2 c i n ≤ Cn -1 2 c i d + C n 1/2 e -1 2 c i n , so that, returning to (E.31), we have uniform control of the quantity |E g2 [Ξ i (ν, g 1 , g 2 )] - Eg 1,g2 [Ξ i (ν, g 1 , g 2 )] | on an event of appropriately high probability. In particular, we have incurred only losses in the constants compared to the pointwise case. Approach to Lipschitz estimates. We will use this framework for controlling the Ξ 1 and Ξ 5 terms only. Accordingly, the sections for those terms below will produce results of the following type, for absolute constants c i , c i , c i , c i , C i , C i , C i , C i , C i > 0 for i = 1, 2, and parameters d ≥ 1, δ > 0 such that d and δ are larger than (separate) absolute constants and n satisfies certain conditions involving d: 1. For each ν ∈ [0, π] fixed, with probability at least 1 -C 1 n -c 1 d -C 1 ne -c 1 n , we have that |E g2 [Ξ i (ν, g 1 , g 2 ) -E[Ξ i (ν, g 1 , g 2 )]]| ≤ C 1 d log n/n + C 1 n -c1d + C 1 ne -c 1 n ; 2. With probability at least 1 -C 2 e -c 2 n -C 2 n -δ , we have that |E g2 [Ξ i (ν, g 1 , g 2 ) - E[Ξ i (ν, g 1 , g 2 )]]| is (C 2 + C 2 n 1+δ )-Lipschitz. We show here that we can use these properties to obtain uniform concentration of the relevant quantities. Write M = C 1 d log n/n + C 1 n -c1d + C 1 ne -c 1 n ; we are interested in showing that uniform bounds of sizes close to M hold with probability not much smaller than that of the pointwise bounds. By Lemma E.48, it follows from the assumed properties that for any 0 < ε < 1 one has P sup ν∈[0,π] E g2 [Ξ i (ν, g 1 , g 2 ) -E[Ξ i (ν, g 1 , g 2 )]] ≤ M + ε C 2 + C 2 n 1+δ ≥ 1 -C 1 n -c 1 d + C 1 ne -c 1 n Kε -1 -C 2 e -c 2 n + C 2 n -δ , where K > 0 is an absolute constant. To make the RHS of the bound on the supremum of size comparable to M , it suffices to choose ε = C 1 d log n/n/(C 2 + C 2 n 1+δ ). We have C 2 + C 2 n 1+δ ≤ K n 1+δ for K > 0 an absolute constant, and so we have ε -1 ≤ K n 3/2+δ for K > 0 another absolute constant. This gives C 1 n -c 1 d + C 1 ne -c 1 n ε -1 ≤ K n 3/2+δ C 1 e -c 1 d log n + C 1 e -c 1 /2n ≤ K n 3/2+δ e -c 1 d log n ≤ K n -c 1 d/2 , where K > 0 is an absolute constant whose value changes from line to line; and where the first inequality assumes that n ≥ (2/c 1 ) log n, the second inequality assumes that n ≥ (2c 1 /c 1 )d log n, and the third assumes that δ ≤ c 1 d/2 -3/2. Choosing d so that the value c 1 d/2 -3/2 is larger than the minimum value for δ (i.e., larger than an absolute constant), then choosing δ = c 1 d/2 -3/2, and finally choosing d ≥ 6/c 1 , we obtain P sup ν∈[0,π] E g2 [Ξ i (ν, g 1 , g 2 ) -E[Ξ i (ν, g 1 , g 2 )]] ≤ 2M ≥ 1-Kn -c 1 d/2 -C 2 e -c 2 n -C 2 n -c 1 d/4 , where K > 0 is an absolute constant, which is an acceptable level of uniformization. Completing the proof. To obtain the desired control, we apply the uniform framework for the terms Ξ i , i = 2, 3, 4, 6; and the pointwise with Lipschitz control framework for the terms Ξ i , i = 1, 5. We also establish high probability control of the zero-order term in Lemma E.38. The events we need for the pointwise framework terms are constructed in Lemmas E.39, E.40, E.44 and E.45. The events we need for the uniform framework are constructed in Lemmas E.41 to E.43 and E.46. Because n and d are chosen appropriately by our hypotheses here, we can invoke each of these lemmas to construct the necessary sub-events and obtain an event E which satisfies 1. One has ∀ν ∈ [0, π], E g2 [Y ν ] -E g1,g2 [Y ν ] ≤ Cν 2 d log n n + n -cd + C ne -c n if g 1 ∈ E; 2. One has P[E] ≥ 1 -C n -c d -C ne -c n . We can adjust d and n slightly to obtain an event with the properties claimed in the statement of the lemma. Indeed, choosing n to be larger than an absolute constant multiple of log n, we can obtain C ne -c n ≤ C e -c n/2 and C ne -c n ≤ C e -c n/2 ; choosing n to be larger than an absolute constant multiple of d log n, we can obtain C n -c d + C e -c n/2 ≤ 2C n -c d ; and choosing d to be larger than an absolute constant, we can assert d log n/n + n -cd ≤ 2 d log n/n. This turns the guarantees of E into the guarantees claimed in the statement of the lemma, and completes the proof.

E.3.3 PROVING LEMMA E.7

Lemma E.14. One has bounds 1 - ν 2 2 ≤ cos ϕ(ν) ≤ 1 -cν 2 , ν ∈ [0, π]. Proof. Write f (ν) = cos ϕ(ν) = cos ν + π -1 (sin ν -ν cos ν) , where the last equality follows from Lemma E.2. We start by obtaining quadratic bounds on f (ν) for ν ∈ [0, 0.1]. In particular, we will show 1 -1 2 ν 2 ≤ f (ν) ≤ 1 -1 4 ν 2 , ν ∈ [0, 0.1]. (E.32) We readily calculate f (ν) = -sin ν + π -1 ν sin ν, f (ν) = -cos ν + π -1 (ν cos ν + sin ν). Taylor expanding at ν = 0 gives 1 + inf t∈[0,0.1] f (t) 2 ν 2 ≤ f (ν) ≤ 1 + sup t∈[0,0.1] f (t) 2 ν 2 . We have f (0) = -1, and sin ν ≤ sin 0.1 on our interval of interest by monotonicity. The derivative of ν cos ν is cos ν -ν sin ν; ν sin ν is increasing as the product of two increasing functions (given ν ≤ 0.1), and one checks that cos(0.1) -0.1 sin(0.1) > 0; therefore ν cos ν ≤ 0.1 cos(0.1) on our domain of interest. One checks numerically -cos(0.1) + π -1 0.1 cos(0.1) + sin(0.1) < -1 2 < 0, and this establishes f (ν) ≤ 1 -1 4 ν 2 on [0, 0.1]. If ν ≤ π/2 , we have cos ≥ 0 and sin ≥ 0, so that ν cos ν + sin ν ≥ 0 on this domain. This implies f (ν) ≥ -cos ν ≥ -1 for 0 ≤ ν ≤ π/2, which proves inf t∈[0,π/2] f (t) = -1, and establishes the lower bound on [0, π/2]. To obtain (possibly) looser bounds on [0, π], we use a bootstrapping approach. The lower bound is more straightforward; to assert the lower bound on [0, π], we evaluate constants numerically to find that the lower bound's value at π/2 is 1 -π 2 /8 < 0, and given that f ≥ 0 by Lemma E.5 and the concave quadratic bound is maximized at ν = 0, it follows that the bound holds on the entire interval. For bootstrapping the upper bound, we note that the equation f (ν) = -sin ν + π -1 ν sin ν = sin ν ν π -1 shows immediately that f is a strictly decreasing function of ν on (0, π). Therefore f (ν) ≤ f (0.1) on [0.1, π], and so the quadratic function ν → 1 -π -2 (1 -f (0.1))ν 2 , which is lower bounded by 1 -ν 2 /4 on [0, π] by the fact that both concave quadratic functions are maximized at 0 and the verification 1 -π 2 /4 < 0 ≤ f (0.1), is an upper bound for f on all of [0, π]; so the claim holds with c = π -2 (1 -f (0.1)). Lemma E.15. There exist absolute constants c, C, C , C > 0 such that if n ≥ C log n, then one has E g1,g2 [X ν ] -cos ϕ(ν) ≤ C e -cn + C ν 2 /n. Proof. Write h(ν) = cos ϕ(ν) -E[X ν ]. By Lemmas E.24 and E.25, we have a second-order Taylor formula h(ν) = h(0) + ν 0 h (0) + t 0 h (s) ds dt. We calculate h (0) = 0, since E[ v 0 , v0 ] = E[ σ(g 1 ), g 2 ] = 0, and P ⊥ v0 v 0 = 0. We also have h(0) = E[ v 0 2 2 ] -E[1 Em ] = µ(E c m ) (writing m = 1), so this formula yields |h(ν)| ≤ µ(E c m ) + ν 2 2 ess sup ν ∈[0,π] |h (ν )|, and we see that it suffices to bound h . We will use the (Lebesgue-a.e.) expression |h (ν)| = E[ vν , v0 ] -E 1 Em 1 v ν 2 I - v ν v * ν v ν 2 2 vν , 1 v 0 2 I - v 0 v * 0 v 0 2 2 v0 . Distributing over the inner product and applying rotational invariance to combine the two cross terms, then using the triangle inequality, we obtain the bound |h (ν)| ≤ E[ vν , v0 ] -E 1 Em v0 , vν v 0 2 v ν 2 Ξ1(ν) + 2E 1 Em v0 , v ν v ν , vν v 0 2 v ν 3 2 Ξ2(ν) + E 1 Em v0 , v 0 v 0 , v ν v ν , vν v 0 3 2 v ν 3 2 Ξ3(ν) . We proceed by giving magnitude bounds for Ξ i (ν), i = 1, 2, 3. Because we are working with expectations, it suffices to fix one value ν ∈ [0, π] and prove pointwise ν-independent bounds; we will exploit this in the sequel to easily define extra good events without having to uniformize, and we will generally suppress the notational dependence of Ξ i on ν as a result. We will also repeatedly use the fact that we have µ(E c 1 ) ≤ Cne -cn for some absolute constants c, C > 0 by Lemma E.16. We will accrue a large number of additive C/n and C n pm e -cn errors as we bound the Ξ i terms; at the end of the proof we will worst-case the constants in each additive error and assert a bound of the form claimed. Ξ 1 control. Let E = { vν 2 ≤ 2} ∩ { v0 2 ≤ 2}. By Lemma E.17 and a union bound, we have µ(E c ) ≤ Ce -cn . Define an event E 1 = E m ∩ E. The first step is to pass to the control of Ξ 1 := E 1 E1 vν , v0 1 - 1 v 0 2 v ν 2 . The triangle inequality gives Ξ 1 -Ξ 1 ≤ E 1 E c 1 vν , v0 + E 1 Em\E vν v ν 2 , v0 v 0 2 The first term is readily controlled from two applications of the Schwarz inequality, a union bound, and rotational invariance together with Lemma E.29: E 1 E c 1 vν , v0 ≤ E 1 E c 1 1/2 E vν 4 2 1/4 E v0 4 2 1/4 ≤ µ(E c m ) + Ce -cn 1/2 E v0 4 2 1/2 ≤ Cne -cn + C e -c n 1/2 1 + C n 1/2 ≤ Cn 1/2 e -cn , where in the last line we require n to be at least the value of a large absolute constant. The calculation is similar for the normalized term, except we also apply the definition of E m to get some extra cancellation: E 1 Em\E vν , v0 v ν 2 v 0 2 ≤ E 1 Em\E | vν , v0 | v ν 2 v 0 2 ≤ 4E 1 Em\E | vν , v0 | ≤ 4E[1 E c | vν , v0 |] ≤ 4E[1 E c ] 1/2 E vν 4 2 1/4 E v0 4 2 1/4 ≤ Ce -cn 1 + C n 1/2 ≤ Ce -cn , where in the last line we apply our bounds from the first term and use n ≥ 1 to obtain the final inequality. Next, Taylor expansion of the smooth convex function x → x -1/2 on the domain x > 0 about the point x = 1 gives x -1/2 = 1 - 1 2 (x -1) + 3 4 x 1 (x -t)t -5/2 dt. (E.33) Given that E m guarantees v ν 2 ≥ 1 2 , we can apply this to get a bound 1 E1 1 - 1 v 0 2 v ν 2 = 1 E1 1 2 v 0 2 2 v ν 2 2 -1 - 3 4 v0 2 2 vν 2 2 1 ( v 0 2 2 v ν 2 2 -t)t -5/2 dt . On E , we also have v 0 -2 2 v ν -2 2 ≤ 2 4 , so we can control the integral residual as 0 ≤ 1 E1 3 4 v0 2 2 vν 2 2 1 ( v 0 2 2 v ν 2 2 -t)t -5/2 dt ≤ 1 E1 384 v 0 2 2 v ν 2 2 -1 2 , where we replace the tighter bound that we get in the case v 0 2 2 v ν 2 2 ≥ 1 with the worst-case bound from the other case. This gives bounds 1 E1 1 2 v 0 2 2 v ν 2 2 -1 -384 v 0 2 2 v ν 2 2 -1 2 ≤ 1 E1 1 - 1 v 0 2 v ν 2 ≤ 1 E1 1 2 v 0 2 2 v ν 2 2 -1 . Given that vν 2 ≤ 2 on E , it follows | v0 , vν | ≤ 4 on E 1 , so that v0 , vν + 4 ≥ 0 here. Writing 1 E1 v0 , vν 1 - 1 v 0 2 v ν 2 = 1 E1 ( v0 , vν + 4) 1 - 1 v 0 2 v ν 2 -4 1 - 1 v 0 2 v ν 2 , we can apply nonnegativity to obtain upper and lower bounds Ξ 1 ≤ E 1 E1 v0 , vν 1 2 v 0 2 2 v ν 2 2 -1 + 4C v 0 2 2 v ν 2 2 -1 2 ; Ξ 1 ≥ E 1 E1 v0 , vν 1 2 v 0 2 2 v ν 2 2 -1 -5C v 0 2 2 v ν 2 2 -1 2 , where C = 384. We continue with bounding the quadratic term arising in the previous equation. We have E 1 E1 v0 , vν v 0 2 2 v ν 2 2 -1 2 ≤ 4E v 0 2 2 v ν 2 2 -1 2 = 4E v 0 4 2 v ν 4 2 -2 v 0 2 2 v ν 2 2 + 1 ≤ 4 1 -2E[ v 0 2 2 v ν 2 2 ] + E v 0 8 2 ≤ 4 1 -2(1 -(Cn -1 + C e -cn )) 2 + 1 + C n ≤ Cn -1 e -cn + C e -c n + C n . The first inequality applies the triangle inequality for the integral, the definition of E 1 and Cauchy-Schwarz, then drops the indicator for E 1 because the remaining terms are nonnegative; the second line is just distributing; the third line rearranges and applies the Schwarz inequality; and the fourth inequality applies Jensen's inequality and Lemma E.18 to control the second term (to apply this lemma, we need to choose n larger than an absolute constant; we assume this is done), and Lemma E.29 to control the third term. Since n ≥ 1, this gives a C/n + C e -cn bound on the quadratic term. Next is the linear term; our first step will be to get rid of the indicator. By the triangle inequality, it suffices to get control of the corresponding term with the indicator for E c 1 instead; we control it as follows: E 1 E c 1 v0 , vν v 0 2 2 v ν 2 2 -1 ≤ E 1 E c 1 1/2 E v0 , vν 2 v 0 2 2 v ν 2 2 -1 2 1/2 ≤ Cne -cn + C e -c n 1/2 E v0 2 2 vν 2 2 v 0 2 2 v ν 2 2 -1 2 1/2 ≤ Cne -cn + C e -c n 1/2 E v0 2 2 vν 2 2 v 0 4 2 v ν 4 2 + v0 2 2 vν 2 2 1/2 ≤ Cne -cn + C e -c n 1/2 E v0 8 2 1/4 E v 0 16 2 1/4 + E v0 4 2 1/2 ≤ Cne -cn + C e -c n 1/2 1 + C 1 n 1/4 1 + C 2 n 1/4 + 1 + C 3 n 1/2 ≤ Cn 1/2 e -cn + C e -c n . The first line is the Schwarz inequality; the second line is the good event measure bound and Cauchy-Schwarz; the third line distributes and drops the cross term, given that all factors are nonnegative; the fourth line applies subadditivity of the square root function, then the Schwarz inequality to the resulting separate terms; the fifth line applies Lemma E.29; and the last line again uses square root subadditivity and treats the remaining terms as multiplicative constants, since n ≥ 1. Therefore passing to the linear term without the indicator incurs only an additional exponential factor. Proceeding, we drop the indicator and distribute to get for the linear term E v0 , vν v 0 2 2 v ν 2 2 -1 = E v0 , vν v 0 2 2 v ν 2 2 -E[ v0 , vν ] ; it is of interest to apply Lemma E.30 to these two terms to get the proper cancellation, and for this we just need to check that the coordinates of each factor in the product have subexponential moment growth with the proper rate. For even powers of 2 norms of v ν , this follows immediately from Lemma G.11 after scaling by 2/n; for the inner product term, the coordinate functions are σ(g 1i )g 2i σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν), and we have from the Schwarz inequality and rotational invariance E | σ(g 1i )g 2i σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν)| k ≤ E σ(g 1i )g 2k 2i , which has subexponential moment growth with rate Cn -1 by Lemma E.17 and Lemma G.11 after rescaling by 2/n. These formulas also show that when k = 1, we have a bound of precisely n -1 . This makes Lemma E.30 applicable, so we can assert bounds E v0 , vν v 0 2 2 v ν 2 2 -1 -n 3 E[( v0 ) 1 ( vν ) 1 ]E σ(g 11 ) 2 2 -nE[( v0 ) 1 ( vν ) 1 ] ≤ C n Because E σ(g 11 ) 2 2 = n -2 , this is enough to conclude a C/n bound on the magnitude of the linear term. Thus, in total, we have shown |Ξ 1 | ≤ C n + C e -cn + C n 1/2 e -c n , where we combine the different constant that appear in the various exponential additive errors throughout our work by choosing the largest magnitude scaling factor and the smallest magnitude constant in the exponent to assert the previous expression. Ξ 2 control. The approach is similar to what we have used to control Ξ 1 . We start with exactly the same E 1 event definition, and as previously define Ξ 2 = E 1 E1 v0 , v ν v ν , vν v 0 2 2 v 0 3 2 v ν 3 2 , and then calculating | Ξ 2 -Ξ 2 | = E 1 Em\E v0 , v ν v ν , vν v 0 2 2 v 0 3 2 v ν 3 2 ≤ 2 6 E 1 Em\E | v0 , v ν v ν , vν | v 0 2 2 ≤ 2 6 E 1 E c | v0 , v ν v ν , vν | v 0 2 2 ≤ 2 6 E[1 E c ] 1/2 E v0 , v ν 4 1/4 E v ν , vν 8 1/8 E v 0 8 2 1/8 ≤ 2 6 E[1 E c ] 1/2 E v0 8 2 1/8 E v ν 8 2 1/8 E v ν 16 2 1/16 E vν 16 2 1/16 E v 0 8 2 1/8 ≤ Ce -cn + C n 1/2 e -c n , using the same ideas as in the previous section, plus several applications of the Schwarz inequality and a final application of Lemma E.29. We can therefore pass to Ξ 2 with a small additive error. Next, we Taylor expand in the same way as previously, except that larger powers in the denominator force the constant in our residual bound to be 3 • 2 27 , and the event E 1 now gives us a bound | v0 , v ν v ν , vν v 0 2 2 | ≤ 2 6 on the numerator, which we add and subtract as before to exploit nonnegativity. We get Ξ 2 ≤ E 1 E1 v0 , v ν v ν , vν v 0 2 2 1 2 3 -v 0 6 2 v ν 6 2 + (2 6 + 1)C v 0 6 2 v ν 6 2 -1 2 ; Ξ 2 ≥ E 1 E1 v0 , v ν v ν , vν v 0 2 2 1 2 3 -v 0 6 2 v ν 6 2 -2 6 C v 0 6 2 v ν 6 2 -1 2 , with C = 3 • 2 27 . Proceeding to control the quadratic term, we have E 1 E1 v0 , v ν v ν , vν v 0 2 2 v 0 6 2 v ν 6 2 -1 2 ≤ 4 3 E v 0 6 2 v ν 6 2 -1 2 = 2 6 E v 0 12 2 v ν 12 2 -2 v 0 6 2 v ν 6 2 + 1 ≤ 2 6 1 -2E[ v 0 6 2 v ν 6 2 ] + E v 0 24 2 ≤ 2 6 1 -2(1 -(Cn -1 + C e -cn )) 6 + (1 + C n -1 ) ≤ Cn -1 + 3 k=1 6 2k -1 C n -1 + C e -cn 2k-1 ≤ Cn -1 + C 3 k=1 2k-1 j=0 n -(2k-1-j) e -cnj ≤ Cn -1 + C e -cn . The justifications for the first four lines are identical to those of the previous section. In the last three lines, we use the binomial theorem twice to expand the sixth power term, and we assert the final line by the fact that k > 0, so that each term in the sum corresponding to a j = 0 has a positive inverse power of n attached, and when j = 2k -1 we pick up an exponential factor. Moving on to the linear term, as in the previous section we start by dropping the indicator. We control the residual as follows: E 1 E c 1 v0 , v ν v ν , vν v 0 2 2 v 0 6 2 v ν 6 2 -3 ≤ E 1 E c 1 1/2 E v0 2 2 vν 2 2 v 0 2 2 v ν 4 2 v 0 6 2 v ν 6 2 -3 2 1/2 ≤ E 1 E c 1 1/2 E v0 2 2 vν 2 2 v 0 14 2 v ν 16 2 + 3 v0 2 2 vν 2 2 v 0 2 2 v ν 4 2 1/2 ≤ E 1 E c 1 1/2 E v0 2 2 vν 2 2 v 0 14 2 v ν 16 2 1/2 + 3E v0 2 2 vν 2 2 v 0 2 2 v ν 4 2 1/2 ≤ Ce -cn + C n 1/2 e -c n . The justifications are almost the same as the previous section, although we have compressed some steps into fewer lines here and we have omitted the final simplifications which follow from applying the Schwarz inequality to each of the two expectations in the second-to-last line 3 times and then applying Lemma E.29. Dropping the indicator and distributing now gives: E v0 , v ν v ν , vν v 0 2 2 v 0 6 2 v ν 6 2 -3 = E v0 , v ν v ν , vν v 0 2 2 v 0 6 2 v ν 6 2 -3E v0 , v ν v ν , vν v 0 2 2 ; to apply Lemma E.30, we check the two new coordinate functions that appear in this linear term: for v0 , v ν , we have E | σ(g 1i )g 2i σ(g 1i cos ν + g 2i sin ν)| k ≤ E σ(g 1i )g 2k 2i 1/2 E σ(g 1i ) 2k 1/2 , (E.34) and for vν , v ν , we have likewise E |σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν)| k ≤ E σ(g 1i )g 2k 2i 1/2 E σ(g 1i ) 2k 1/2 , (E.35) both by the Schwarz inequality and rotational invariance. As before, an appeal to Lemmas G.11 and E.17 implies that these two coordinate functions satisfy the hypotheses of Lemma E.30, so we have a bound E v0 , v ν v ν , vν v 0 2 2 v 0 6 2 v ν 6 2 -3 -n 9 E[( v0 ) 1 (v ν ) 1 ]E[( vν ) 1 (v ν ) 1 ]E σ(w 11 ) 2 7 + 3n 3 E[( v0 ) 1 ( vν ) 1 ]E[( vν ) 1 (v ν ) 1 ]E σ(w 11 ) 2 ≤ C n . Noticing that E[ v ν , vν ] = -E[ v 0 , v0 ] = -E[ σ(g 1 ), g 2 ] = 0, by rotational invariance and independence, we conclude by identically-distributedness of the coordinates of v ν and vν n 9 E[( v0 ) 1 (v ν ) 1 ]E[( vν ) 1 (v ν ) 1 ]E σ(g 11 ) 2 7 -3n 3 E[( v0 ) 1 ( vν ) 1 ]E[( vν ) 1 (v ν ) 1 ]E σ(g 11 ) 2 = 0, which establishes the desired control on Ξ 2 . Thus, in total, we have shown |Ξ 2 | ≤ C n + C e -cn + C n 1/2 e -c n , where we combine the different constant that appear in the various exponential additive errors throughout our work by choosing the largest magnitude scaling factor and the smallest magnitude constant in the exponent to assert the previous expression. Ξ 3 control. The argument for control of this term is very similar to the previous section, since the degrees of the denominators now match. We start by defining Ξ 3 = E 1 E1 v0 , v 0 v 0 , v ν v ν , vν v 0 3 2 v ν 3 2 , with the same E 1 event as previously, and then calculating | Ξ 3 -Ξ 3 | = E 1 Em\E v0 , v 0 v ν , vν v 0 , v ν v 0 3 2 v ν 3 2 ≤ 2 6 E 1 Em\E | v0 , v 0 v ν , vν v 0 , v ν | ≤ 2 6 E[1 E c | v0 , v 0 v ν , vν v 0 , v ν |] ≤ 2 6 E[1 E c ] 1/2 E v0 , v 0 4 1/4 E v ν , vν 8 1/8 E v 0 , v ν 8 1/8 ≤ 2 6 E[1 E c ] 1/2 E v0 8 2 1/8 E v 0 8 2 1/8 E vν 16 2 1/16 E v 0 16 2 1/16 E v ν 16 2 1/8 ≤ Cn 1/2 e -cn + Ce -c n , using the same ideas as in the previous section. We can therefore pass to Ξ 3 with an exponentially small error. Next, we Taylor expand in the same way as previously, obtaining Ξ 3 ≤ E 1 E1 v 0 , v ν v0 , v 0 v ν , vν 1 2 3 -v 0 6 2 v ν 6 2 + (4 3 + 1)C v 0 6 2 v ν 6 2 -1 2 ; Ξ 3 ≥ E 1 E1 v 0 , v ν v0 , v 0 v ν , vν 1 2 3 -v 0 6 2 v ν 6 2 -4 3 C v 0 6 2 v ν 6 2 -1 2 , with C = 3 • 2 27 . Proceeding to control the quadratic term, we notice E 1 E1 v0 , v 0 v ν , vν v 0 , v ν v 0 6 2 v ν 6 2 -1 2 ≤ 4 3 E v 0 6 2 v ν 6 2 -1 2 = Cn -1 + C e -cn , since the final term was controlled in the previous section. Moving on to the linear term, as in the previous section we start by dropping the indicator. We control the residual as follows: E 1 E c 1 v0 , v 0 v ν , vν v 0 , v ν v 0 6 2 v ν 6 2 -3 ≤ E 1 E c 1 1/2 E v0 2 2 vν 2 2 v 0 4 2 v ν 4 2 v 0 6 2 v ν 6 2 -3 2 1/2 ≤ E 1 E c 1 1/2 E v0 2 2 vν 2 2 v 0 16 2 v ν 16 2 + 3 v0 2 2 vν 2 2 v 0 4 2 v ν 4 2 1/2 ≤ E 1 E c 1 1/2 E v0 2 2 vν 2 2 v 0 16 2 v ν 16 2 1/2 + 3E v0 2 2 vν 2 2 v 0 4 2 v ν 4 2 1/2 ≤ Ce -cn + C n 1/2 e -c n , by the same argument as in the previous section. Dropping the indicator and distributing now gives: E v0 , v 0 v ν , vν v 0 , v ν v 0 6 2 v ν 6 2 -3 = E v0 , v 0 v ν , vν v 0 , v ν v 0 6 2 v ν 6 2 -3E[ v0 , v 0 v ν , vν v 0 , v ν ]; to apply Lemma E.30, we check the one new coordinate function that appears in this linear term: for v 0 , v ν , we have E |σ(g 1i )σ(g 1i cos ν + g 2i sin ν)| k ≤ E σ(g 1i ) 2k , (E.36) by the Schwarz inequality and rotational invariance. As before, an appeal to Lemmas G.11 and E.17 implies that this coordinate function satisfies the hypotheses of Lemma E.30, so we have a bound E v0 , v 0 v ν , vν v 0 , v ν v 0 6 2 v ν 6 2 -3 -n 9 E[( v0 ) 1 (v 0 ) 1 ]E[( vν ) 1 (v ν ) 1 ]E[(v ν ) 1 (v 0 ) 1 ]E σ(g 11 ) 2 6 +3n 3 E[( v0 ) 1 (v 0 ) 1 ]E[( vν ) 1 (v ν ) 1 ]E[(v ν ) 1 (v 0 ) 1 ] n -1 . As in the previous section, using that E[ v ν , vν ] = 0 then allows us to conclude the desired control on Ξ 3 . Thus, in total, we have shown |Ξ 3 | ≤ C n + C e -cn + C n 1/2 e -c n , where we combine the different constant that appear in the various exponential additive errors throughout our work by choosing the largest magnitude scaling factor and the smallest magnitude constant in the exponent to assert the previous expression. To wrap up, we take the largest of the scaling constants in the estimates we have derived, and the smallest of the constants-in-the-exponent that we have derived, in order to assert |h (ν)| ≤ C n + C n 1/2 e -cn . Matching constants in the exponent and choosing n larger than an absolute constant multiple of log n, it follows |h(ν)| ≤ Ce -cn + C ν 2 n , which was to be proved.

E.3.4 GENERAL PROPERTIES

Lemma E.16. Consider the event E c,m = S⊂[n] |S|=m ν∈[0,2π] (g 1 , g 2 ) c ≤ I S c v ν (g 1 , g 2 ) 2 ≤ c -1 . Suppose n ≥ max{2m, m + 20}. Then we have the following properties: 1. µ(E c c,m ) ≤ Cn m e -c n ; 2. We have E c,m = E c,m Q for every Q ∈ O(2), so that in particular 1 Ec,m (GQ) = 1 Ec,m (G). Above, O(n) denotes the set of n × n orthogonal matrices. Proof. We will show the second property first. For each c > 0, if Q ∈ O(2), notice that E c,m Q = S⊂[n] |S|=m ν∈[0,2π] GQ c < I S c σ G cos ν sin ν 2 < c -1 = S⊂[n] |S|=m u∈S 1 G c < I S c σ (GQ * u) 2 < c -1 = E c,m , since the vector [cos ν, sin ν] * ∈ S 1 , and O(2) acts transitively on S 1 . This proves the second property when c > 0; the result for c = 0 is obtained by applying the preceding argument to each set in the infinite union defining the c = 0, m event. For the measure bound, we observe that E c,m ⊂ E c ,m if c ≥ c , so it suffices to bound the measure of the complement for the particular choice c = 1 2 . We start by controlling pointwise the measure of the complement of the event E 0.6,m,u = S⊂[n] |S|=m {G | 0.6 < I S c σ (Gu) 2 < 5/3} for each u ∈ S 1 , then uniformize over the one-dimensional manifold S 1 ; we need to begin with c = 0.6 instead of c = 1 2 to survive some loosening of the bounds when we uniformize. We have E c 0.6,m,u = S⊂[n] |S|=m {G | I S c σ (Gu) 2 ≤ 0.6} ∪ {G | I S c σ (Gu) 2 ≥ 5/3}, so that a union bound implies µ E c 0.6,m,u ≤ S⊂[n] |S|=m P[ I S c σ (Gu) 2 ≤ 0.6] + P[ I S c σ (Gu) 2 ≥ 5/3] ≤ n m P I [m] c σ (g 1 ) 2 ≤ 0.6 + P I [m] c σ (g 1 ) 2 ≥ 5/3 , (E.37) where the final inequality follows from right-rotational invariance of µ and identicallydistributedness of the coordinates of g 1 . Let g ∈ R n-m be distributed as N (0, (2/n)I), so that σ(g) has the same distribution as I [m] c σ (g 1 ) By Gauss-Lipschitz concentration (Boucheron et al., 2013, Theorem 5.6), we have P[ σ(g) 2 ≥ E[ σ(g) 2 ] + t] ≤ e -cnt 2 , P[ σ(g) 2 ≤ E[ σ(g) 2 ] -t] ≤ e -cnt 2 , since σ is 1-Lipschitz and nonnegative homogeneous. After rescaling, we apply Lemma E.19 to get 1 - m n - 2 √ n √ n -m ≤ E[ σ(g) 2 ] ≤ 1 - m n ≤ 1 Plugging these estimates into the Gauss-Lipschitz bounds gives P[ σ(g) 2 ≥ 1 + t] ≤ e -cnt 2 , P σ(g) 2 ≤ 1 - m n - 2 √ n √ n -m -t ≤ e -cnt 2 . Putting t = 2/3 in the upper tail bound gives the control we need for one half of (E.37). For the lower tail, we note that the assumption n ≥ max{2m, m + 20} yields the estimates 1 - m n ≥ 1 √ 2 , 2 √ n √ n -m ≤ 2 n -m ≤ 1 10 , so that 1 - m n - 2 √ n √ n -m -t ≥ 1 √ 2 - 1 10 -t, and one checks numerically that 2 -1/2 -(1/10) > 0.6. Putting therefore t = 2 -1/2 -(1/10) -0.6 in the lower tail bound yields P[ σ(g) 2 ≤ 0.6] ≤ e -cn . Plugging these results into (E.37) gives the pointwise measure bound µ E c 0.6,m,u ≤ 2 n m e -cn for some constant c > 0. For uniformization, fix S ⊂ [n] with |S| = m and consider the function f S : R 2 → R defined by f S (u) = I S c σ (Gu) 2 . By Gauss-Lipschitz concentration, we have P[ G > E[ G ] + t] ≤ e -cnt 2 , and by (Rudelson & Vershynin, 2011, Theorem 2.6), we have E[ G ] ≤ √ 2 + 2 √ n ≤ 4. Let E = { G ≤ 5}; then it follows that µ(E) ≥ 1 -e -cn . On E, for every S, we have that f S is a 5-Lipschitz function of u. Let T ε ⊂ S 1 be a family of sets with the property that u ∈ S 1 implies that there is u ∈ T ε such that uu 2 ≤ ε for each ε > 0; by standard results (Vershynin, 2018, Corollary 4.2.13) , T ε exists and we have |T ε | ≤ (1 + 2ε -1 ) 2 . Define E 0.6,m,ε = u∈Tε E 0.6,m,u . Then a union bound together with our pointwise concentration result gives µ E c 0.6,m,ε ≤ 2 n m 1 + 2 ε 2 e -cn . On E ∩ E 0.6,m,ε , for any u ∈ S 1 and any S, there is u ∈ T ε such that |f S (u) -f S (u )| ≤ 5ε. But since on this event 0.6 ≤ f S (u ) ≤ 5/3, we conclude 0.6 -5ε ≤ f S (u) ≤ 5/3 + 5ε, and therefore the choice ε = 1/50 gives 0.5 ≤ f S (u) ≤ 2. This implies E ∩ E 0.6,m,1/50 ⊂ E 0.5,m . We calculate |E[ v 0 2 v ν 2 ] -E[X]| ≤ µ(E c ) + E[1 E ] 1/2 E[ v 0 4 2 ] 1/2 ≤ Ce -cn + C e -c n (1 + C /n) 1/2 using the triangle inequality, the Schwarz inequality, rotational invariance, and Lemmas E.16 and E.29. It follows E[ v 0 2 v ν 2 ] ≥ E[X] -C e -cn , so it suffices to prove the lower bound for X instead. Factoring as X = ( v 0 2 1 E +1 E c )( v ν 2 1 E + 1 E c ), we apply concavity of x → log x, Jensen's inequality, and convexity of x → e x to get E[X] ≥ exp E log v 0 2 1 E + 1 E c + E log v ν 2 1 E + 1 E c ≥ 1 + E log v 0 2 1 E + 1 E c + E log v ν 2 1 E + 1 E c ≥ 1 + 2E log v 0 2 1 E + 1 E c where the last equality is due to rotational invariance. Now write Y = v 0 2 1 E + 1 E c , so that by the definition of E we have Y ≥ 1 2 . Taylor expansion with Lagrange remainder of the logarithm about E[Y ] ≥ 1 2 gives log(Y ) = log E[Y ] - 1 E[Y ] (Y -E[Y ]) - 1 2ξ(Y ) 2 (Y -E[Y ]) 2 for some ξ(Y ) between E[Y ] and Y . Using Y ≥ 1 2 and taking expectations on both sides, we get E[log Y ] ≥ log E[Y ] -2Var[Y ]. Moreover, we have |E[Y ] -E[ v 0 2 ]| ≤ Ce -cn + E[1 E c v 0 2 ] ≤ Ce -cn + C e -c n , by the Schwarz inequality, and this extra exponential error can be rolled into the exponential error accrued via our use of X. In particular, we have 1 - 2 n -Ce -cn ≤ E[Y ] ≤ 1 + Ce -cn , by Lemma E.19. Since n ≥ 20, if we also enforce n ≥ C 1 := c -1 log(5C/2) we have 2/n + Ce -cn ≤ 1 2 ; it follows by concavity of x → log(1 -x) that we have a bound log 1 - 2 n -Ce -cn ≥ -2 log(2) 2 n + Ce -cn , which has the form claimed. It remains to upper bound Var[Y ]; using that Y 2 = v 0 2 2 1 E + 1 E c , we have Var[Y ] = E[Y 2 ] -E[Y ] 2 ≤ 1 + Ce -cn -1 - 2 n -Ce -cn 2 = Ce -cn + 2 2 n + Ce -cn - 2 n + Ce -cn 2 ≤ 4 n + 3Ce -cn , which is sufficient to conclude. Lemma E.19. One has 1 - 2 n ≤ E g1,g2 [ v ν 2 ] ≤ 1. Proof. By rotational invariance, it is equivalent to characterize the expectation of σ(g 1 ) 2 . By the Schwarz inequality, we have E[ v 0 2 ] ≤ E v 0 2 2 1/2 = 1, by Lemma G.11. For the lower bound, we apply the Gaussian Poincaré inequality (Boucheron et al., 2013, Theorem 3.20) and the 1-Lipschitz property of g → σ(g) 2 to get n 2 E ( v 0 2 -E[ v 0 2 ]) 2 ≤ 1, so that after distributing and applying E[ v 0 2 2 ] = 1, we see that 1 - 2 n ≤ E[ v 0 2 ] 2 . Because n ≥ 2, it follows E[ v 0 2 ] ≥ 1 - 2 n ≥ 1 - 2 n , where the last bound holds because 1 -2n -1 ≤ 1. Lemma E.20. If 0 ≤ x, y ≤ 1, we have |cos -1 x -cos -1 y| ≤ |x -y|. Proof. Let 0 ≤ x, y ≤ 1, and assume to begin that x ≤ y. We apply the fundamental theorem of calculus and knowledge of the derivative of cos -1 to get cos -1 x -cos -1 y = y x 1 √ 1 -t 2 dt The integrand is nonnegative, so cos -1 x -cos -1 y ≥ 0. Writing √ 1 -t 2 = √ 1 -t √ 1 + t and using x ≥ 0, we get cos -1 x -cos -1 y ≤ y x 1 √ 1 -t dt = √ 1 -x -1 -y. This shows that |cos -1 x-cos -1 y| ≤ | √ 1 -x-√ 1 -y| when x ≤ y. An almost-identical argument establishes the same when y ≤ x, via the inequalities 0 ≥ cos -1 x-cos -1 y ≥ -( √ 1 -x-√ 1 -y). So we have shown Fix x ∈ A, and consider the case f (x) = c. Then because f is continuous, there is a neighborhood of x on which f = c. If f > c on this neighborhood, then we have max{f, c} = f on this neighborhood; if f < c, then we have max{f, c} = c. In either case, this implies that max{f, c} is differentiable at x, and thus x is not in B. |cos -1 x -cos -1 y| ≤ | √ 1 -x -1 -y| for arbitrary 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. Now notice | √ 1 -x -1 -y| 2 ≤ | √ 1 -x -1 -y|| √ 1 -x + 1 -y| ≤ |(1 -x) -(1 -y)| = |x -y|, which establishes |cos -1 x -cos -1 y| ≤ |x -y|. Next, consider the case where f (x) = c. First, suppose f (x) > 0; then by Rolle's theorem, we can find a neighborhood of x on which f (x ) > c if x > x and f (x ) < c if x < x. Possibly shrinking this neighborhood, we can assume every point of the neighborhood is a point of differentiability of f . Thus, for x < x in this neighborhood, we have max{f (x ), c} = f (x ), and for x > x, we have max{f (x ), c} = c. We conclude that max{f, c} is differentiable at all points of this neighborhood except x, and in particular x is an isolated point in B. A symmetric argument treats the case where f (x) < 0, with the same conclusion. On the other hand, if f (x) = 0, we can write f (x ) = c + o(|x -x|) for x in a neighborhood of x, which implies max{f (x ), c} = max{c, c + o(|x -x|)} = c ± o(|x -x|). In particular, |max{f (x ), c} -max{f (x), c}| = o(|x -x|), which shows that max{f, c} is differentiable at x, and thus x is not in B. This shows that every point of A ∩ B is isolated in A ∩ B, and we can therefore conclude that max{f, c} is differentiable except at isolated points of (a, b). Lemma E.22. For 0 ≤ ν ≤ π, consider the function φ(ν) = E g1,g2∼ i.i.d. N (0,(2/n)I) [1 E1 φ(ν, g 1 , g 2 )], where φ(ν, g 1 , g 2 ) = cos -1 v 0 , v ν v 0 2 v ν 2 . Then φ is absolutely continuous on [0, π], and satisfies the first-order Taylor expansion φ(ν) = φ(0) - ν 0 E g1,g2     1 E1 v0 v0 2 , 1 vt 2 I - vtv * t vt 2 2 vt 1 - v0 v0 2 , vt vt 2 2     dt, and moreover φ is 1-Lipschitz. Proof. At points of (0, π) where each of the functions composed in φ is differentiable, the chain rule gives for the derivative of the integrand as a function of ν φ (ν, g 1 , g 2 ) = - v0 v0 2 , 1 vν 2 I - vν v * ν vν 2 2 vν 1 - v0 v0 2 , vν vν 2 2 , (E.38) where we have used the result d • • x = 1 x 2 I - xx * x 2 2 , valid for any x = 0. Because E 1 guarantees that v ν = 0 for all ν ∈ [0, π], we see that the integrand φ is continuous. Similarly, given that v ν 2 ≥ 1 2 on E 1 , we note that there are just two obstructions to differentiability: 1. The inverse cosine is not differentiable at {±1}; 2. The activation σ is not differentiable at 0. First we characterize the issue of nondifferentiability with regards to the inverse cosine. We note that cos φ(ν, g 1 , g 2 ) = 1 if and only if the Cauchy-Schwarz inequality is tight, which is equivalent to v 0 and v ν being linearly dependent. Suppose we have (g 1 , g 2 ) ∈ E 1 and ν 0 ∈ (0, π) such that v 0 (g 1 , g 2 ) and v ν0 (g 1 , g 2 ) are linearly dependent. Because two vectors u 1 , u 2 ∈ R n have σ(u 1 ) and σ(u 2 ) linearly dependent if and only if σ(u 1 ) and σ(u 2 ) have the same support and are linearly dependent on the support, and given that v ν 0 > 1 for each ν, we have that there is a 2 × 2 submatrix of GM ν0 having positive entries and rank 1 (since the rank is zero if and only if the submatrix is zero), where M ν = 1 cos ν 0 sin ν . Write the corresponding 2 × 2 submatrix of G as X. Because rank M ν0 = 2 by ν 0 ∈ (0, π), we have rank X = 1. On the other hand, if G ∼ i.i.d. N (0, 2/n), we have P[G has a singular 2 × 2 minor] ≤ 1≤i<j≤n P rank G 1i G 2i G 1j G 2j < 2 = 0, where the first line is a union bound, and the second line uses the fact that 2 × 2 submatrices of G are i.i.d. N (0, 2/n), and that the complement of the set of full-rank 2 × 2 matrices is a positivecodimensional closed embedded submanifold of R 2×2 . It follows that the subset of E 1 of matrices having no singular 2 × 2 minor has full measure in E 1 , and we conclude that for almost all (g 1 , g 2 ), we have cos φ(ν, g 1 , g 2 ) < 1 for every ν ∈ (0, π). Next, we characterize nondifferentiability due to the activation σ; by the chain rule, it suffices to consider nondifferentiability of v ν as a function of ν, and then Lemma E.21 implies that for every (g 1 , g 2 ), v ν is differentiable at all but at most countably many points of [0, π]. Next, we observe that whenever v ν is nonvanishing, one has v 0 v 0 2 , 1 v ν 2 I - v ν v * ν v ν 2 2 vν ≤ P ⊥ vν vν 2 v ν 2 I - v ν v * ν v ν 2 2 v 0 v 0 2 2 = P ⊥ vν vν 2 v ν 2 1 - v0 v0 2 , vν vν 2 2 , where the first inequality is due squaring the orthogonal projection and Cauchy-Schwarz, and the second equality follows from distributing to evaluate the squared norm, cancelling, and taking square roots. Using the fact that orthogonal projections have operator norm 1, we thus conclude |φ (ν, g 1 , g 2 )| ≤ P ⊥ vν vν 2 v ν 2 ≤ C vν 2 , (E.39) where the last inequality is valid whenever (g 1 , g 2 ) ∈ E 1 . Since vν 2 = σ(g 1 cos ν + g 2 sin ν) (g 2 cos ν -g 1 sin ν) 2 ≤ g 2 cos ν -g 1 sin ν 2 ≤ g 2 + g 1 2 , and this upper bound is jointly integrable in ν and (g 1 , g 2 ) over [0, π] × R n × R n , we can apply (Cohn, 2013, Theorem 6.3.11) to obtain that whenever (g 1 , g 2 ) ∈ E 1 minus a negligible set, we have for every ν ∈ [0, π] φ(ν, g 1 , g 2 ) = φ(0, g 1 , g 2 ) + ν 0 φ (t, g 1 , g 2 ) dt. In particular, multiplying by the indicator for E 1 , taking expectations over (g 1 , g 2 ), and applying the previous joint integrability assertion for φ together with Fubini's theorem yields φ(ν) = φ(0) + ν 0 E g1,g2 [φ (t, g 1 , g 2 )] dt, so to conclude the Lipschitz estimate, it suffices to obtain a suitable estimate on Eg 1,g2 [φ (ν, g 1 , g 2 )]. In light of (E.39) we calculate more precisely E 1 E1 P ⊥ vν vν 2 v ν 2 = E 1 E1 P ⊥ v0 v0 2 v 0 2 = E   1 E1 I -σ(g1)σ(g1) * σ(g1) 2 2 ( σ(g 1 ) g 2 ) 2 σ(g 1 ) 2   ≤ E   1 σ(g1) 0>1 I -σ(g1)σ(g1) * σ(g1) 2 2 ( σ(g 1 ) g 2 ) 2 σ(g 1 ) 2   = n k=2 2 -n n k E   I -σ(g1)σ(g1) * σ(g1) 2 2 ( σ(g 1 ) g 2 ) 2 σ(g 1 ) 2 σ(g 1 ) 0 = k   = n k=2 2 -n n k E X∼χ(k-1) [X] E Y ∼χ(k) 1 Y . In the first line, we apply rotational invariance and unpack notation; in the second line, we use nonnegativity of the integrand to pass to the containing event where v 0 is at least 2-sparse; and in the third line, we condition on the size of the support of g 1 . In the fourth line, we use several facts; first, we note that P ⊥ v0 ( σ(g 1 ) g 2 ) = P ⊥ v0 P {σ(g1)>0} g 2 for any g 2 ∈ R n , and that the commutation relation P ⊥ v0 P {σ(g1)>0} = P {σ(g1)>0} P ⊥ v0 implies that the operator P ⊥ v0 P {σ(g1)>0} is itself an orthogonal projection, with range equal to the ( v 0 0 -1)-dimensional subspace consisting of vectors with support supp(v 0 ) orthogonal to v 0 . In particular, σ(g 1 ) and P ⊥ v0 P {v0>0} g 2 are independent gaussian vectors, and conditioned on the size of the support of σ(g 1 ) the quantities σ(g 1 ) 2 and P ⊥ v0 P {v0>0} g 2 2 are distributed as independent chi random variables with (respectively) k and k -1 degrees of freedom. An application of Lemma G.9 then gives E 1 E1 P ⊥ vν vν 2 v ν 2 ≤ 1, (E.40) which is sufficient to conclude. Lemma E.23. The random variable X ν satisfies the following regularity properties: 1. If 0 < ν ≤ π, we have X ν < 1 almost surely. 2. If (g 1 , g 2 ) ∈ E 1 , then X ν is absolutely continuous on [0, π], with a.e. derivative Ẋν = v 0 v 0 2 , 1 v ν 2 I - v ν v * ν v ν 2 2 vν , and moreover we have Eg 1 ,g2 [| Ẋν |] ≤ 1, so the analogous differentiation result applies to Eg 1 ,g2 [X ν ]. Proof. The first claim is a corollary of the proof of differentiability of the inverse cosine part of φ in Lemma E.22 and the observation that X π = 0. The second claim is also a direct consequence of the proof of Lemma E.22 and Fubini's theorem. Lemma E.24. Consider the function f (ν) = E g1,g2 [X ν ] = E g1,g2 1 E1 v 0 v 0 2 , v ν v ν 2 . Then f is continuously differentiable, with derivative f (ν) = E g1,g2 1 E1 v 0 v 0 2 , 1 v ν 2 I - v ν v * ν v ν 2 2 vν . Moreover, f is absolutely continuous, with Lebesgue-a.e. derivative f (ν) = -E g1,g2 1 E1 1 v ν 2 I - v ν v * ν v ν 2 2 vν , 1 v 0 2 I - v 0 v * 0 v 0 2 2 v0 Proof. The expression for f is a direct consequence of Lemma E.23. To see that f is actually continuous, apply rotational invariance of the Gaussian measure and of 1 E1 by Lemma E.16 to get f (ν) = -E g1,g2 1 E1 v ν v ν 2 , 1 v 0 2 I - v 0 v * 0 v 0 2 2 v0 , then notice that this expression is an integral of a continuous function of ν, which is therefore continuous. Moreover, the ν dependence in this expression for f mirrors exactly that of f ; in particular, the integrand - v ν v ν 2 , 1 v 0 2 I - v 0 v * 0 v 0 2 2 v0 is absolutely continuous whenever (g 1 , g 2 ) ∈ E 1 by Lemma E.23, with a.e. derivative - 1 v ν 2 I - v ν v * ν v ν 2 2 vν , 1 v 0 2 I - v 0 v * 0 v 0 2 2 v0 . We can therefore conclude the claimed expression for f provided we can show absolute integrability over E 1 of this last expression, using Fubini's theorem in a way analogous to the argument in Lemma E.22. But E g1,g2 1 E1 vν v ν 2 I - v ν v * ν v ν 2 2 , v0 v 0 2 I - v 0 v * 0 v 0 2 2 ≤ 4 E g1,g2 1 E1 P ⊥ vν vν 2 P ⊥ v0 v0 2 ≤ 4E v0 2 2 = 4, using, in sequence, Cauchy-Schwarz and the lower bound in the definition of E 1 ; the operator norm of orthogonal projections being 1, the Schwarz inequality, nonnegativity of the integrand, and rotational invariance; and Lemma E.17. We can therefore conclude the claimed expression for f and complete the proof. Lemma E.25. For the heuristic cosine angle evolution function cos ϕ(ν) = E g1,g2 [ v 0 , v ν ], we have the following integral representations for its continuous derivatives: (cos •ϕ) (ν) = E g1,g2 [ v 0 , vν ] (cos •ϕ) (ν) = -E g1,g2 [ v0 , vν ]. Proof. The proof follows exactly the arguments of Lemma E.24, but with a simpler integrand and different integrability checks; the continuity assertion relies on Lemma E.5. Indeed, this approach gives that v 0 , v ν is absolutely continuous, with Lebesgue-a.e. derivative v 0 , vν ; we check E g1,g2 [| v 0 , vν |] ≤ E g1,g2 v 0 2 2 1/2 E g1,g2 v0 2 2 1/2 ≤ 1 by Cauchy-Schwarz, the Schwarz inequality, rotational invariance, and Lemma E.17. This verifies the claimed expression for (cos •ϕ) . For the second derivative, we apply rotational invariance to get (cos •ϕ) (ν) = -E g1,g2 [ v ν , v0 ], which has an absolutely continuous integrand, with Lebesgue-a.e. derivative -v0 , vν . Checking absolute integrability, we have as before E g1,g2 [| v0 , vν |] ≤ E g1,g2 v0 2 2 ≤ 1 by Cauchy-Schwarz, the Schwarz inequality, rotational invariance, and Lemma E.17. This establishes the claimed expression for (cos •ϕ) . Lemma E.26. Let ψ : R → R be defined by ψ(x) = ψ 0.25 (x), where ψ 0.25 is the function constructed in Lemma E.31. Then the function f (ν, g 1 ) = E g2 v 0 , v ν ψ( v 0 2 )ψ( v ν 2 ) satisfies for all ν ∈ [0, π] and Lebesgue-a.e. g 1 the second-order Taylor expansion f (ν, g 1 ) = v 0 2 2 ψ( v 0 2 ) 2 + ν 0 t 0 E g2 n i=1 σ(g 1i ) 3 ρ(-g 1i cot s) ψ( v 0 2 )ψ( v i s 2 ) sin 3 s -E g2 v 0 , v s ψ( v 0 2 )ψ( v s 2 ) - v 0 , v s ψ ( v s 2 ) v s 2 ψ( v 0 2 )ψ( v s 2 ) 2 -E g2 + v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 2 + E g2 -2 v 0 , vs v s , vs ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 - v 0 , v s vs 2 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 + E g2 2 v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 3 v s 2 2 + v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 3 2 ds. where previously-unspecified notation in this expression is introduced in (E.44). Proof. Take g 1 ∈ R n such that f (ν, • ) exists and is g 1 -integrable; by Fubini's theorem such g 1 have full measure in R n . Because ψ > 0 and ψ( v ν ) is locally (as a function of ν) constant whenever v ν < 1 4 , we need only consider nondifferentiability of σ when assessing differentiability of f ( • , g 1 ). By Lemma E.21, we conclude that f ( • , g 1 ) is differentiable at all but at most countably many points of (0, π); since ψ > 0 and ψ is smooth, f is continuous, and we can therefore apply Lebesgue differentiation theorems (Cohn, 2013, Theorem 6.3.11) to f provided we satisfy the standard derivative product integrability checks. Writing φ(ν, g 1 , g 2 ) = v 0 , v ν ψ( v 0 2 )ψ( v ν 2 ) , the chain rule gives (at points of differentiability) φ (ν, g 1 , g 2 ) = v 0 ψ( v 0 2 )ψ( v ν 2 ) , vν - v 0 , v ν ψ ( v ν 2 )v ν ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 , vν . In this expression, we follow the convention 0/0 = 0 to account for the possibility that v ν 2 = 0 (in this case, the ψ term handles the denominator). For product integrability, we Lemma E.31 to get |ψ | ≤ C for some absolute constant C > 0 together with Cauchy-Schwarz and the triangle inequality to get |φ (ν, g 1 , g 2 )| ≤ 16 v 0 2 vν 2 + 64C v 0 2 v ν 2 vν 2 , and applying the Schwarz inequality, rotational invariance (to eliminate ν dependence in the resulting expectations) and Lemma E.17, we conclude that φ is jointly absolutely integrable over [0, π] × (R n×2 , µ ⊗ µ). We have therefore a first-order Taylor expansion f (ν, g 1 ) = f (0, g 1 ) + ν 0      E g2 v 0 ψ( v 0 2 )ψ( v t 2 ) , vt Ξ1(ν) -E g2 v 0 , v t ψ ( v t 2 )v t ψ( v 0 2 )ψ( v t 2 ) 2 v t 2 , vt Ξ2(ν)      dt. We have f (0, g 1 ) = E g2 v 0 2 2 ψ( v 0 2 ) 2 = v 0 2 2 ψ( v 0 2 ) 2 , since v 0 depends only on g 1 . Next, we show t-differentiability of the inner expectation. Our aim is to apply Lemma E.27 to differentiate Ξ 1 and Ξ 2 . We first focus on Ξ 1 ; distributing and applying linearity, we have Ξ 1 (ν) = n i=1 E g2 σ(g 1i )(g 2i cos ν -g 1i sin ν) ψ( v 0 2 )ψ( v ν 2 ) σ(g 1i cos ν + g 2i sin ν) . We have shown absolute integrability of the quantity inside the expectation above; we can therefore apply Fubini's theorem and the previous definition to write Ξ 1 (ν) = n i=1 E (g2j ):j =i E g2i σ(g 1i )(g 2i cos ν -g 1i sin ν) ψ( v 0 (g 1 , g 2 ) 2 )ψ( v ν (g 1 , g 2 ) 2 ) σ(g 1i cos ν + g 2i sin ν) . (E.41) For each i ∈ [n], write π i : R n → R n-1 for the linear map that deletes the i-th coordinate from its input, and let πi : R × R n-1 → R n be the linear map such that πi (g i , π i (g)) = g. With g 2 fixed (in the context of (E.41)), if we define f 1 (ν, g) = σ(g 1i )(g cos ν -g 1i sin ν) ψ( v 0 (g 1 , πi (g, π i (g 2 ))) 2 )ψ( v ν (g 1 , πi (g, π i (g 2 ))) 2 ) , then we can write Ξ 1 (ν) = n i=1 E (g2j ):j =i E g2i [f 1 (ν, g 2i ) σ(g 1i cos ν + g 2i sin ν)] . Thus, to differentiate Ξ 1 , it suffices to check the regularity of f 1 (ν, g) and apply Lemma E.27. As before, ψ > 0 and ψ smooth implies that f 1 is continuous on [0, π] × R. For integrability of f , we appeal to the Fubini's theorem justification that we applied previously. For absolute continuity, we apply Lemma E.21 to get that the derivative of f with respect to ν is, by the chain rule, f 1 (ν, g) = -σ(g 1i ) g 1i cos ν + g sin ν ψ( v 0 2 )ψ( v ν 2 ) + (g cos ν -g 1i sin ν)ψ ( v ν 2 ) v ν , vν ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 at all but at most countably many values of ν; and the triangle inequality, Cauchy-Schwarz, and Lemma E.31 yield |f 1 (ν, g)| ≤ σ(g 1i ) (16(|g 1i | + |g|) + 64C(|g| + |g 1i |) vν 2 ) ≤ σ(g 1i )(|g| + |g 1i |) (16 + 64C( g 1 2 + g 2 2 )) ≤ σ(g 1i )(|g| + |g 1i |) (16 + 64C( g 1 2 + π i (g 2 ) 2 + |g|)) , (E.42) (we apply square root subadditivity in the last line) which is jointly integrable over [0, π] × R, and moreover over [0, π] × R n . We conclude absolute continuity of f 1 ( • , g) and the integrability property of f 1 . Finally, for the growth estimate, we obtain an estimate for f 1 similar to the one we just obtained for f 1 as follows: |f 1 (ν, g)| ≤ 16|g 1i |(|g| + |g 1i |); (E.43) the RHS of the final inequality above is a linear function of |g|, and when |g| ≥ 1 we can therefore obtain |f 1 (ν, g)| ≤ 16(|g 1i | + |g 1i | 2 )|g|, which is a suitable growth estimate with p = 1. Then as long as g 1i = 0 for all i (such g 1 form a set of measure zero, which we can neglect), we can apply Lemma E.27 to get Ξ 1 (ν) = n i=1 E (g2j ):j =i E g2i [f 1 (0, g 2i ) σ(g 1i )] + ν 0 Eg 2i [f 1 (t, g 2i ) σ(g 1i cos t + g 2i sin t)] -g 1i f1(t,-g1i cot t)ρ(-g1i cot t) sin 2 t dt . The estimates (E.42) and (E.43) show, respectively, that f 1 and f 1 are absolutely integrable functions of (ν, g 2 ). We have f 1 (t, -g 1i cot t) = - σ(g 1i ) 2 ψ( v 0 (g 1 , g 2 ) 2 )ψ( v t (g 1 , πi (-g 1i cot t, π i (g 2 ))) 2 ) sin t , so that Lemma E.31 and nonnegativity give g 1i f 1 (t, -g 1i cot t) ρ(-g 1i cot t) sin 2 t ≤ 16 σ(g 1i ) 3 sin 3 t ρ(-g 1i cot t). As in the proof of Lemma E.37, in particular using the estimates (E.52) (E.53) to control the magnitude of the RHS for all values of t, we can conclude that the Dirac term is absolutely integrable over [0, π] × R n . An application of Fubini's theorem then allows us to re-combine the split integrals in the previous expression: Ξ 1 (ν) = n i=1 E g2 [f 1 (0, g 2i ) σ(g 1i )] + ν 0 Eg 2 [f 1 (t, g 2i ) σ(g 1i cos t + g 2i sin t)] -g 1i ρ(-g1i cot t) sin 2 t Eg 2 [f 1 (t, -g 1i cot t)] dt. We notice that v t (g 1 , πi (-g 1i cot t, π i (g 2 ))) =            σ(g 11 cos ν + g 21 sin ν) . . . σ(g 1(i-1) cos ν + g 2(i-1) sin ν) 0 σ(g 1(i+1) cos ν + g 2(i+1) sin ν) . . . σ(g 1n cos ν + g 2n sin ν)            , and thus motivated introduce the notation gi (t, g 1 , g 2 ) = πi (-g 1i cot t, π i (g 2 )); v i t (g 1 , g 2 ) = v t (g 1 , gi (t, g 1 , g 2 )) . (E.44) We can then write -g 1i f 1 (t, -g 1i cot t) ρ(-g 1i cot t) sin 2 t = σ(g 1i ) 3 ρ(-g 1i cot t) ψ( v 0 2 )ψ( v i t 2 ) sin 3 t . Finally, we apply linearity of the integral to move the summation over i back inside the integrals, obtaining Ξ 1 (ν) = E g2 v 0 , v0 ψ( v 0 2 ) 2 + ν 0 E g2 n i=1 σ(g 1i ) 3 ρ(-g 1i cot t) ψ( v 0 2 )ψ( v i t 2 ) sin 3 t -E g2 v0,vt ψ( v0 2 )ψ( vt 2) + v0, vt vt, vt ψ ( vt 2 ) ψ( v0 2)ψ( vt 2) 2 vt 2 dt. Noting that, in the zero-order term, the only g 2 dependence is in v0 = σ(g 1 ) g 2 , we apply independence of g 1 and g 2 to obtain finally Ξ 1 (ν) = ν 0 E g2 n i=1 σ(g 1i ) 3 ρ(-g 1i cot t) ψ( v 0 2 )ψ( v i t 2 ) sin 3 t dt - ν 0 E g2 v 0 , v t ψ( v 0 2 )ψ( v t 2 ) + v 0 , vt v t , vt ψ ( v t 2 ) ψ( v 0 2 )ψ( v t 2 ) 2 v t 2 dt We run the same type of argument on Ξ 2 next. Distributing and applying linearity, we have Ξ 2 (ν) = n i=1 E g2 [I m σ(g 1i cos ν + g 2i sin ν)A], where in the previous expression A = v 0 , v ν σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν)ψ ( v ν 2 ) ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 . By the preceding (product) absolute integrability check when taking first derivatives, we can apply Fubini's theorem to split the integral as we did with Ξ 1 . We define, with g 2 fixed, the function f 2 (ν, g) = B v 0 (g 1 ), v ν (g 1 , πi (g, π i (g 2 ))) ψ( v 0 (g 1 ) 2 )ψ( v ν (g 1 , πi (g, π i (g 2 ))) 2 ) 2 v ν (g 1 , πi (g, π i (g 2 ))) 2 where in the previous expression B = σ(g 1i cos ν + g sin ν)(g cos ν -g 1i sin ν)ψ ( v ν (g 1 , πi (g, π i (g 2 ))) 2 ), so that Ξ 2 (ν) = n i=1 E (g2j ):j =i E g2i [f 2 (ν, g 2i ) σ(g 1i cos ν + g 2i sin ν)] . Now we check that the hypotheses of Lemma E.27 are satisfied for f 2 . The continuity argument is identical to that employed for f 1 , as is the joint absolute integrability property of f 2 . For absolute continuity, we again use ψ > 0, ψ smooth, and Lemma E.21 to obtain the derivative at all but finitely many points of [0, π] (by the chain rule and the Leibniz rule) as f 2 (ν, g) = v 0 , vν σ(g 1i cos ν + g sin ν)(g cos ν -g 1i sin ν)ψ ( v ν 2 ) ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 + v 0 , v ν σ(g 1i cos ν + g sin ν)(g cos ν -g 1i sin ν) 2 ψ ( v ν 2 ) ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 - v 0 , v ν σ(g 1i cos ν + g sin ν)(g 1i cos ν + g sin ν)ψ ( v ν 2 ) ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 -2 v 0 , v ν σ(g 1i cos ν + g sin ν)(g cos ν -g 1i sin ν)ψ ( v ν 2 ) v ν , vν ψ( v 0 2 )ψ( v ν 2 ) 3 v ν 2 2 - v 0 , v ν σ(g 1i cos ν + g sin ν)(g cos ν -g 1i sin ν)ψ ( v ν 2 ) v ν , vν ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 3 2 + v 0 , v ν σ(g 1i cos ν + g sin ν)(g cos ν -g 1i sin ν)ψ ( v ν 2 ) v ν , vν ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 2 . Because ψ or ψ and our convention handle cancellation in the case where v ν 2 = 0, we can proceed when necessary with the convenient estimate σ(g 1i cos ν + g sin ν) v ν (g 1 , πi (g, π i (g 2 ))) 2 ≤ 1, which follows from the fact that u ∞ ≤ u 2 for any u ∈ R n . As with Ξ 1 , we then estimate the magnitude of f 2 using Lemma E.31, Cauchy-Schwarz, the triangle inequality, and square-root subadditivity (skipping some steps that we wrote out in the Ξ 1 estimate): |f 2 (ν, g)| ≤ 64C(|g| + |g 1i |) v 0 2 vν 2 2 + C C + (|g| + |g 1i |) (2 + 8 vν 2 ) ≤ 64C(|g| + |g 1i |) g 1 2 ( g 1 2 + π i (g 2 ) 2 + |g|) 2 + C C +(|g| + |g 1i |) (2 + 8( g 1 2 + π i (g 2 ) 2 + |g|)) , (E.45) which is jointly integrable over [0, π]×R, and moreover over [0, π]×R n . We conclude absolute continuity of f 2 ( • , g) and the integrability property of f 2 . For the growth estimate, we argue similarly to our bound on f 2 to get |f 2 (ν, g)| ≤ 64C v 0 2 (|g| + |g 1i |) 2 ≤ 64C g 1 2 |g| 2 + 2|g 1i ||g| + |g 1i | 2 ; (E.46) the RHS in the final inequality is a quadratic function of |g|, and we therefore obtain a suitable growth estimate with p = 2 and C = 64C g 1 (1 + 2|g 1i | + |g 1i | 2 ) as soon as |g| ≥ 1. We can therefore apply Lemma E.27 to get that for all but a negligible set of g 1 that Ξ 2 (ν) = n i=1 E (g2j ):j =i E g2i [f 2 (0, g 2i ) σ(g 1i )] + ν 0 Eg 2i [f 2 (t, g 2i ) σ(g 1i cos t + g 2i sin t)] -g 1i f2(t,-g1i cot t)ρ(-g1i cot t) sin 2 t dt . The estimates (E.45) and (E.46) show, respectively, that f 2 and f 2 are absolutely integrable functions of (ν, g 2 ). Because σ(g 1i cos ν -g 1i cot ν sin ν) = 0, we have (fortuitously) f 2 (t, -g 1i cot t) = 0, so that there is no Dirac term in the derivative expression for Ξ 2 . An application of Fubini's theorem then allows us to re-combine the split integrals in the previous expression: Ξ 2 (ν) = n i=1 E g2 [f 2 (0, g 2i ) σ(g 1i )] + ν 0 E g2 [f 2 (t, g 2i ) σ(g 1i cos t + g 2i sin t)] dt. We have by linearity of the integral n i=1 E g2 [f 2 (0, g 2i ) σ(g 1i )] = E g2 v 0 , v0 v 0 ψ ( v 0 ) ψ( v 0 ) 3 = 0, where the last equality applied independence of g 1 and g 2 , as in the zero-order term of Ξ 1 . Finally, we apply linearity of the integral to move the summation over i back inside the remaining integrals, obtaining Ξ 2 (ν) = ν 0 E g2 v 0 , vt v t , vt ψ ( v t 2 ) ψ( v 0 2 )ψ( v t 2 ) 2 v t 2 + v 0 , v t vt 2 2 ψ ( v t 2 ) ψ( v 0 2 )ψ( v t 2 ) 2 v t 2 + E g2 - v 0 , v t ψ ( v t 2 ) v t 2 ψ( v 0 2 )ψ( v t 2 ) 2 -2 v 0 , v t v t , vt 2 ψ ( v t 2 ) ψ( v 0 2 )ψ( v t 2 ) 3 v t 2 2 + E g2 - v 0 , v t v t , vt 2 ψ ( v t 2 ) ψ( v 0 2 )ψ( v t 2 ) 2 v t 3 2 + v 0 , v t v t , vt 2 ψ ( v t 2 ) ψ( v 0 2 )ψ( v t 2 ) 2 v t 2 2 dt. Since f (ν, g 1 ) = f (0, g 1 ) + ν 0 Eg 2 [Ξ 1 (t) -Ξ 2 (t) ] dt, the claim follows. Lemma E.27. Let µ denote the distribution of a N (0, (2/n)) random variable, and let ρ denote its density. Let u ∈ R and u = 0, and let f : [0, π] × R → R satisfy: 1. f is continuous in its second argument with its first argument fixed; 2. f is absolutely continuous in its first argument with its second argument fixed, with a.e. derivative f ; 3. f and f are absolutely integrable with respect to the product of Lebesgue measure and µ over [0, π] × R; 4. There exist p ≥ 1 and C > 0 constants independent of x such that |f (ν, x)| ≤ C|x| p whenever |x| ≥ 1. Consider the function q(ν) = R f (ν, x) σ(u cos ν + x sin ν) dµ(x). Then q is absolutely continuous, and the following first-order Taylor expansion holds: q(ν) = q(0) + ν 0 -u f (t, -u cot t) ρ(-u cot t) sin 2 t + R f (t, x) σ(u cos t + x sin t) dµ(x) dt. Proof. For m ∈ N, define σm (x) =    0 x ≤ 0 mx 0 ≤ x ≤ m -1 1 x ≥ m -1 . Then 0 ≤ σm ≤ 1; σm is continuous, hence Borel measurable; σm → σ pointwise as m → ∞; and σm is differentiable on R except at x ∈ {0, m -1 }, with derivative σm = m1 0≤x≤m -1 . Moreover, we have R m1 0≤x≤m -1 dx = 1, and the first-order Taylor expansion σm (x) = x 0 m1 0≤x ≤m -1 dx . Define q m (ν) = R f (ν, x) σm (u cos ν + x sin ν) dµ(x). Then at every ν ∈ [0, π], we have by assumption R |f (ν, x) σm (u cos ν + x sin ν)| dµ(x) ≤ R |f (ν, x)| dµ(x) < +∞, so that the dominated convergence theorem implies lim m→∞ q m (ν) = q(ν). By the chain rule, the expression σm (x) = -max{-m max{x, 0}, -1}, and Lemma E.21, ν → σm (u cos ν + x sin ν) is an absolutely continuous function of ν ∈ [0, π], and we therefore have by the product rule for AC functions on an interval (Cohn, 2013, Corollary 6.3.9) q m (ν) = q m (0) + R dµ(x) ν 0 dt f (t, x) σm (u cos t + x sin t) + mf (t, x)(x cos t -u sin t)1 0≤u cos ν+x sin ν≤m -1 . We have R π 0 |f (t, x) σm (u cos t + x sin t)| dt dµ(x) ≤ R π 0 |f (t, x)| dt dµ(x) < +∞, by assumption, and R f (t, x)(x cos t -u sin t)1 0≤u cos t+x sin t≤m -1 dµ(x) ≤ R f (t, x) 2 dµ(x) 1/2 R (x cos t -u sin t) 2 dµ(x) 1/2 ≤ C f |u| + R x 2 dµ(x) 1/2 < +∞, by the growth assumption on f and the Schwarz inequality. Applying compactness of [0, π] and the lack of ν dependence in the final inequality above, an application of Fubini's theorem therefore yields q m (ν) = q m (0) + ν 0 R dµ(x) dt f (t, x) σm (u cos t + x sin t) + mf (t, x)(x cos t -u sin t)1 0≤u cos ν+x sin ν≤m -1 . By dominated convergence and the first of the preceding two product integrability checks, it is clear lim m→∞ ν 0 R f (t, x) σm (u cos t + x sin t) dµ(x) dt = ν 0 R f (t, x) σ(u cos t + x sin t) dµ(x) dt. For the second term, we need to proceed more carefully. For k ∈ N sufficiently large for the integral to be over a nonempty interval, we consider q m,k (ν) := ν-k -1 k -1 dt R mf (t, x)(x cos t -u sin t) 1 √ 2πc 2 e -x 2 2c 2 1 0≤u cos t+x sin t≤m -1 dx, which is a truncated version of the integral constituting the second term in q m , with a change of variables applied to explicitly show the density corresponding to µ, and where we write c 2 = 2/n. In particular, by the calculation used to apply Fubini's theorem in this context previously, we have by dominated convergence lim k→∞ q m,k (ν) = ν 0 R mf (t, x)(x cos t -u sin t)1 0≤u cos t+x sin t≤m -1 dµ(x) dt. By the product integrability assumption on f and Fubini's theorem, we can consider the inner Rintegral for fixed t, and due to our truncation we have 0 < t < π; we therefore change variables x → x sin -1 t in the inner integral to get q m,k (ν) = ν-k -1 k -1 dt R mf t, x sin t x cos t sin 2 t -u 1 √ 2πc 2 e -x 2 2c 2 sin 2 t 1 0≤u cos t+x≤m -1 dx. If 0 < t < π and x ∈ R, define g(t, x) = f t, x sin t x cos t sin 2 t -u 1 √ 2πc 2 e -x 2 2c 2 sin 2 t , so that, after an additional change of variables x → x -u cos t, we obtain ) ≥ K|u|, where we can take K = 2 -1/2 -2 -1 > 0. Applying the triangle inequality and the condition on |x| gives q m,k (ν) = m ν-k -1 k -1 dt R g (t, x -u cos t) 1 0≤x≤m -1 dx. |g(t, x -u cos t)| ≤ C(3/2) p+1 |u| p+1 sin p+2 t exp - K 2 u 2 2c 2 sin 2 t , which only depends on t. For any constants c , C > 0, the continuous map y → C|y| p+2 e -c y 2 is a bounded function of y ∈ R by L'Hôpital's rule applied to determine lim y→±∞ |y| p e -y 2 = 0 for any p > 0. It follows that there is a constant M ≥ 0 depending only on c, u, p such that |g(t, xu cos t)| ≤ M whenever 0 ≤ tπ/4; we obtain the result for t = 0 by the previous limit calculation. Applying symmetry and taking the sum of our two bounds then yields |g(t, x -u cos t)| ≤ M for M ≥ 0 not depending on k, m whenever (t, x) ∈ [0, π] × [-u/2, u/2].

Now, we have after one additional change of variables

x → xm -1 q m,k (ν) = ν-k -1 k -1 dt R g t, xm -1 -u cos t 1 0≤x≤1 dx. We can invoke our M bound when xm -1 ≤ |u|/2, and the indicator enforces |x| ≤ 1; thus, taking m ≥ 2/|u| (here we use |u| > 0 critically) implies ν-k -1 k -1 dt R g t, xm -1 -u cos t 1 0≤x≤1 dx ≤ M ν-k -1 k -1 dt < +∞, so that by dominated convergence, we have lim k→∞ q m,k (ν) = ν 0 dt R g t, xm -1 -u cos t 1 0≤x≤1 dx. By the same estimate together with second-argument continuity of f , hence of g, we have by the dominated convergence theorem lim m→∞ lim k→∞ q m,k (ν) = ν 0 g(t, -u cos t) dt = -u ν 0 f (t, -u cot t) sin 2 t 1 √ 2πc 2 e -u 2 cot 2 t 2c 2 dt. Combining with our results on q m and the first term, we conclude q(ν) = q(0) + ν 0 dt -u f (t, -u cot t) sin 2 t 1 √ 2πc 2 e -u 2 cot 2 t 2c 2 + R f (t, x) σ(u cos t + x sin t) dµ(x) , as claimed.

E.3.6 MISCELLANEOUS ANALYTICAL RESULTS

Lemma E.28. If m > 0, then φ is 1-Lipschitz. Proof. We recall φ(ν) = E g1,g2∼ i.i.d. N (0,(2/n)I) cos -1 X ν . Considering instead the related function φ defined by φ(ν) = E g1,g2∼ i.i.d. N (0,(2/n)I) [1 E1 φ(ν, g 1 , g 2 )], where φ(ν, g 1 , g 2 ) = cos -1 v 0 , v ν v 0 2 v ν 2 , we notice φ(ν) = φ(ν) + (π/2)µ(E c 1 ). It is therefore equivalent to show that φ is 1-Lipschitz; but this follows from Lemma E.22. Lemma E.29 (Even Moments). If k ∈ N and k ≤ n, one has E v ν 2k 2 -1 ≤ C k n -1 , E vν 2k 2 -1 ≤ C k n -1 , where C k ≤ (k -1) 2 4 k-1 (2k -1)!!. Proof. First notice that the claim is immediate if k = 1, since E v ν 2 2 = 1. We therefore proceed assuming k > 1. Also notice that Lemmas G.11 and E.17 show that vν and v ν have matching even moments, so it suffices to prove the claim for v ν . By rotational invariance, we can write E v ν 2k 2 = 2 k n k E gi∼N (0,1)   n i=1 σ(g i ) 2 k   = 2 k n k 1≤i1,...,i k ≤n E   k j=1 σ(g ij ) 2   , where the last sum is taken over all elements of [n] k . We split this sum into a sum over terms whose expectations contain no repeated indices, and a sum over all other terms. There are exactly k! n k ways to choose a k-multi-index from an alphabet of size n without repetitions-select the k distinct indices, then arrange them in every possible way-and multi-indices without repetitions correspond to terms in the sum where the expectation factors completely, by independence, so we can write E v ν 2k 2 = 2 k n k     k! n k E σ(g 1 ) 2 k + 1≤i1,...,i k ≤n only repeated indices E   k j=1 σ(g ij ) 2       . We will prove the elementary estimate n k -k! n k ≤ (k -1) 2 n k-1 2 k-2 . (E.47) Assuming it for the time being, we use that E σ(g 1 ) 2 k = 2 -k to conclude (2/n) k k! n k E σ(g 1 ) 2 k -1 ≤ (k -1) 2 2 k-2 n -1 . Next we study the expectation-of-products arising in the sum. The expectation factors over distinct indices; we can classify repeated indices in a multi-index by partitions j 1 +. . . j m = k, where each j l is a positive integer. Formally, for each multi-index (i 1 , . . . , i k ), there is a partition j 1 + . . . j m = k such that E   k j=1 σ(g ij ) 2   = m l=1 E σ(g i p(l) ) 2j l , where p : [m] → [k] is injective. We can evaluate these expectations using the result E σ(g 1 ) 2k = 1 2 (2k -1)!!, because the coordinates of g are i.i.d.: m l=1 E σ(g i p(l) ) 2j l = 1 2 m m i=1 (2j i -1)!!. We claim that 1 2 m m i=1 (2j i -1)!! ≤ 1 2 (2k -1)!!, (E.48) which is the expectation obtained from a term with all indices equal, whence 2 k n k 1≤i1,...,i k ≤n only repeated indices E   k j=1 σ(g ij ) 2   ≤ 2 k n k n k-1 (k -1) 2 2 k-2 E σ(g 1 ) 2k = (k -1) 2 2 2k-3 (2k -1)!! n -1 by (E.47), which gives a bound on the number of terms in the sum. Noticing that this constant is larger than (k -1) 2 2 k-2 , we can conclude the claimed estimate on C k provided we can justify (E.48). For this, it suffices to show 1 ≤ 2 m-1 (2k -1)!! m i=1 (2j i -1)!! . Observe that m ≥ 1 for any partition, so 2 m-1 ≥ 1 and we need only study the second term on the righthand side. We write this term as If j 1 = k, then this product is empty and m = 1, so the claim is established. If not, then we proceed to the next group of factors in the denominator: we get (2k -1)!! m i=1 (2j i -1)!! = k i=1 (2i -1) k i=j1+1 (2i -1) j2 i=1 (2i -1) ≥ 1, because j 1 > 0 implies that every term in the numerator (ordered in ascending order) is larger than the corresponding term in the denominator. This gives the claim in the case m = 2; for m > 2, we conclude the claim by induction. To close the loop, we prove (E.47). Using simple algebra, we observe n k -k! n k = n k -n(n -1) . . . (n -k + 1) = n k   1 - k-1 j=1 1 - j n   , and we note bounds 1 - k -1 n k-1 ≤ k-1 j=1 1 - j n ≤ 1. Working on the upper bound first, we obtain with the help of the binomial theorem 1 - k-1 j=1 1 - j n ≤ 1 -1 - k -1 n k-1 = k-1 j=1 k -1 j (-1) j+1 k -1 n j ≤ k -1 n k-2 j=0 k -1 j + 1 k -1 n j , where the last expression removes cancellation by making each term in the sum nonnegative, then applies a change of index. With the identity k-1 j+1 = (k -1)/(j + 1) k-2 j , we proceed as k -1 n k-2 j=0 k -1 j + 1 k -1 n j = (k -1) 2 n k-2 j=0 k -2 j 1 j + 1 k -1 n j ≤ (k -1) 2 n k-2 j=0 k -2 j k -1 n j = (k -1) 2 n 1 + k -1 n k-2 , given that 1/(j + 1) ≤ 1. Since n ≥ k, this gives n k -k! n k ≤ (k -1) 2 n k-1 2 k-2 . The upper bound on the product gives immediately n k -k! n k ≥ 0, which completes the proof. Lemma E.30 (Mixed Moments). Let g 1 , . . . , g n denote the n (i.i.d. according to N (0, (2/n)I 2 )) rows of the matrix G. Let k ∈ [n], and for each 1 ≤ j ≤ k let f j : R 2 → R be a function such that 1. E[|f j (g 1 )| p ] 1/p ≤ Cn -1 p, with C > 0 an absolute constant and p ≥ 1; 2. E |f j (g 1 )| ≤ n -1 . Consider the quantities A = E   k j=1 n i=1 f j (g i )   ; B = n k k j=1 E f j (g 1 ) . Then one has |A -B| ≤ Cn -1 , with the constant depending only on k.

Proof. Start by writing

A = 1≤i1,...,i k ≤n E   k j=1 f j (g ij )   = k! n k k j=1 E f j (g 1 ) + 1≤i1,...,i k ≤n only repeated indices E   k j=1 f j (g ij )   = n -k k! n k B + 1≤i1,...,i k ≤n only repeated indices E   k j=1 f j (g ij )   as in Lemma E.29. Applying the triangle inequality and the first moment assumption on the functions f j , we get n -k k! n k B -B = |B| k! n k n -k -1 ≤ (k -1) 2 2 k-2 n -1 , with the last inequality following from the estimate (E.47). For the remaining term, we have by the triangle inequality 1≤i1,...,i k ≤n only repeated indices E   k j=1 f j (g ij )   ≤ n k -k! n k sup (i1,...,i k )⊂[n] k E   k j=1 f j (g ij )   ≤ (k -1) 2 n k-1 2 k-2 sup (i1,...,i k )⊂[n] k E   k j=1 f j (g ij )   , using again (E.47) to control the number of terms in the sum. To control the supremum, we apply the Schwarz inequality k -1 times to get -1) . E   k j=1 f j (g ij )   ≤ E f 1 (g i1 ) 2 1/2 E   k j=2 f j (g ij ) 2   1/2 ≤ . . . ≤   k-1 j=1 E f j (g ij ) 2 j 2 -j   E f k (g i k ) 2 k-1 2 -(k By the subexponential assumption on the functions f j , we have moment growth control, and we therefore have a bound E   k j=1 f j (g ij )   ≤   k-1 j=1 C 1 n -1 2 j   C 1 n -1 2 k-1 = C k 1 n -k 2 (k-1)+ k-1 j=1 j = C k 1 n -k 2 1 2 (k-1)(k+2) , and consequently 1≤i1,...,i k ≤n only repeated indices E   k j=1 f j (g ij )   ≤ C k 1 (k -1) 2 2 1 2 k(k+3) n -1 , which proves the claim. Lemma E.31. For any 0 < c ≤ 1 2 , there exists a smooth function ψ c : R → R satisfying 1. ψ c (x) = x if x ≥ 2c and ψ c (x) = c if x ≤ c, and ψ c is between c and 2c if c ≤ x ≤ 2c; 2. ψ c (x) ≥ 1 2 x; 3. There are constants M 1 , M 2 > 0 depending only on c such that |ψ c | ≤ M 1 and |ψ c | ≤ M 2 . Proof. The function f (x) = 1 x>0 e -1 x is smooth on R, and satisfies 0 ≤ f ≤ 1 and f = 0 if x ≤ 0. The function φ c (x) = f (x) f (x) + f (c -x) is therefore smooth, satisfies 0 ≤ φ c ≤ 1, and satisfies φ c (x) = 0 if x ≤ 0 and φ c (x) = 1 if x ≥ c. Simplifying using the definitions, we can write φ c (x) =        0 x ≤ 0 1 1+exp c-2x x(c-x) 0 < x < c 1 x ≥ c. It follows that x → xφ c (x) is zero when x ≤ 0, x when x ≥ c, and in between otherwise. Thus, the function ψ c (x) = c + (x -c)φ c (x -c) satisfies property 1. For property 2, we note that ψ c (x) = c + (x -c)φ c (x -c) implies that ψ c ≥ c, since φ c (x -c) = 0 whenever x ≤ c and φ c ≥ 0. Since ψ c (x) = x when x ≥ 2c, we can then conclude ψ c (x) ≥ 1 2 x, since 1 2 x ≤ c when x ≤ 2c and 1 2 x ≤ x when x ≥ 2c. For property 3, we note that by property 1, ψ c (x) = 1 if x ≥ 2c and ψ c (x) = 0 if x ≤ 0; consequently ψ c (x) = 0 if x ∈ [0, 2c], and it suffices to control ψ c and ψ c in this region. By translation equivariance of the derivative, it then suffices to control the derivatives of h(x) = xφ c (x) for 0 < x < c. We calculate h (x) = xφ c (x) + φ c (x), h (x) = xφ c (x) + 2φ c (x), (E.49) and φ c (x) = f (c -x)f (x) -f (x)f (c -x) (f (x) + f (c -x)) 2 , (E.50) φ c (x) = (f (x) + f (c -x)) (f (c -x)f (x) + f (x)f (c -x) -2f (x)f (c -x)) (f (x) + f (c -x)) 3 -2 (f (x) -f (c -x))(f (c -x)f (x) -f (x)f (c -x)) (f (x) + f (c -x)) 3 . (E.51) Completely ignoring possible cancellation, we see that it suffices to get a lower bound on f (x) + f (c -x) and upper bounds on f and f to bound |h | and |h |. We calculate f (x) + f (c -x) = 1 x 2 e -1 x 1 x>0 - 1 (c -x) 2 e - 1 c-x 1 x<c , and since f (x) > 0 if x > 0 and c > 0, we see that any solution of f (x) -f (c -x) = 0 must occur for x ∈ (0, c), which implies as well c -x ∈ (0, c). Writing g(x) = x 2 e -x and using c -1 < x -1 < ∞ for x ∈ (0, c), we note from our previous work that f (x) -f (c -x) = 0 ⇐⇒ g(x -1 ) = g((c -x) -1 ). We calculate g (x) = xe -x (2 -x), so that if x > 2 then g (x) < 0, which implies that g is injective on (2, ∞). By assumption, we have c -1 > 2; consequently there is at most one solution to f (x) -f (c -x) = 0 in 0 < x < c, and given that x = 1 2 c is a solution, there is exactly one solution. We check 2f (c/2) < f (0) + f (c) ⇐⇒ log 2 < 1/c, where the first RHS is the value of f (x) + f (c -x) at both x = 0 and x = c, and since 1/c ≥ 2, we conclude that f (x) + f (c -x) ≥ 2f (c/2) > 0. Next, we use f (x) = 1 x 2 e -1 x 1 x>0 , f (x) = 1 x 4 e -1 x - 2 x 3 e -1 x 1 x>0 , together with the bound x p e -x ≤ p p e -p for p > 0, which is proved by differentiating x → x p e -x , equating to zero, and comparing the values of the function at x = 0, x = p, and x → ∞, to obtain with the triangle inequality Lemma E.32. Let Z, Z ∈ L 2 be square-integrable random variables. Suppose that Z ≤ C a.s. and |f (x)| ≤ 4/e 2 , |f Z -Z L 2 ≤ M . Then Var[Z] ≤ Var[ Z] + CM + M 2 . Proof. This is a simple consequence of the triangle inequality and the centering inequality for the L 2 norm. We have Z -E[Z] L 2 ≤ Z -Z -E[Z -Z] L 2 + Z -E[ Z] L 2 , and additionally Z -Z -E[Z -Z] L 2 ≤ Z -Z L 2 ≤ M, so that, after squaring, we get Var[Z] ≤ Var[ Z] + M Z -E[ Z] L 2 + M 2 ≤ Var[ Z] + M Z L 2 + M 2 ≤ Var[ Z] + CM + M 2 , by centering and the a.s. boundedness assumption. Lemma E.33. Let X, Y be square-integrable random variables, and let d > 0. Suppose |X| ≤ M 1 a.s., and suppose P[|Y -1| ≥ C d/n] ≤ C e -cd and Y -1 L 2 ≤ M 2 . Then one has with probability at least 1 -C e -cd |XY -E[XY ]| ≤ |X -E[X]| + 2CM 1 d n + √ C M 1 M 2 e -cd/2 . Proof. We apply the triangle inequality: |XY -E[XY ]| ≤ |XY -X| + |X -E[X]| + |E[X] -E[XY ]| ≤ M 1 |Y -1| + M 1 E[|Y -1|] + |X -E[X]| , where the second inequality also applies Jensen's inequality. We have E[|Y -1|] = E 1 |Y -1|≥C √ dn -1 + 1 |Y -1|<C √ d/n |Y -1| ≤ C d n + E 1 |Y -1|≥C √ d/n |Y -1| ≤ C d n + E 1 |Y -1|≥C √ d/n 1/2 E (Y -1) 2 1/2 ≤ C d n + √ C e -cd/2 M 2 , Proof. We start from the formula Var k i=1 X i = n i=1 Var[X i ] + 2 i<j cov[X i , X j ], where cov [X i , X j ] = E[X i X j ] -E[X i ]E[X j ] = E[(X i -E[X i ])(X j -E[X j ])] ; one establishes this formula by distributing in the definition of the variance. By assumption, there are events E i on which |X i -E[X i ]| ≤ N i and such that P[E i ] ≥ 1 -δ i . Partitioning the expectation, we therefore have Var[X i ] = E[(X i -E[X i ]) 2 ] ≤ N 2 i + E[1 E c i (X i -E[X i ]) 2 ] ≤ N 2 i + E[1 E c i ] 1/2 E[(X i -E[X i ]) 4 ] 1/2 ≤ N 2 i + δ i M 2 i , where the first line uses nonnegativity of the integrand to discard the indicator after applying the deviations bound, the second line applies the Schwarz inequality, and the third line uses fourth moment control. For the covariance terms, we apply Jensen's inequality to obtain |cov[X i , X j ]| = |E[X i X j ] -E[X i ]E[X j ]| ≤ E[|X i -E[X i ]||X j -E[X j ]|], so that, again partitioning the outermost expectation and applying our assumptions, we get |cov[X i , X j ]| ≤ N 2 i + E[1 E c i ∪E c j |X i -E[X i ]||X j -E[X j ]|] ≤ N 2 i + E[1 E c i ∪E c j ] 1/2 E[(X i -E[X i ]) 4 ] 1/4 E[(X j -E[X j ]) 4 ] 1/4 ≤ N i N j + δ i + δ j M i M j , where in the first line we again use nonnegativity of the integrand to discard the indicator after applying the deviations bound, in the second line we apply the Schwarz inequality twice, and in the third line we use a union bound to control the indicator. Since δ i ≤ 2δ i , we conclude the claimed expression. Lemma E.36. If C > 0 and p > 0, the function g(t) = t p e -Ct 2 for t ≥ 0 satisfies the bound g(t) ≤ (p/(2Ce)) p/2 . Proof. The function g is smooth has derivatives g (t) = t p-1 e -Ct 2 (p -2Ct 2 ) and g (t) = t p-2 e -Ct 2 (p(p -1) -2(4p -1)Ct 2 + 4C 2 t 4 ). It therefore has at most two critical points, one possibly at t = 0 and one at t = p/(2C), and these points are distinct when p > 0 and C > 0. We check the sign of g at the second critical point; since p/(2C) > 0 we need only check the value of (p(p -1) -2(4p -1)Ct 2 + 4C 2 t 4 ) evaluated at t = p/(2C), which is -2p 2 < 0. Then since lim t→±∞ g(t) = 0 and g(0) = 0, we conclude that g(t) ≤ g( p/(2C)), which gives the claimed bound. Lemma E.37. Following Lemma E.26, consider the random variables Ξ 1 (s, g 1 , g 2 ) = n i=1 σ(g 1i ) 3 ρ(-g 1i cot s) ψ( v 0 2 )ψ( v i s 2 ) sin 3 s Ξ 2 (ν, g 1 , g 2 ) = v 0 , v s ψ ( v s 2 ) v s 2 ψ( v 0 2 )ψ( v s 2 ) 2 - v 0 , v s ψ( v 0 2 )ψ( v s 2 ) Ξ 3 (ν, g 1 , g 2 ) = - v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 2 Ξ 4 (ν, g 1 , g 2 ) = -2 v 0 , vs v s , vs ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 Ξ 5 (ν, g 1 , g 2 ) = - v 0 , v s vs 2 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 Ξ 6 (ν, g 1 , g 2 ) = 2 v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 3 v s 2 2 + v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 3 2 , where Ξ 1 ( • , g 1 , g 2 ) is defined at {0, π} by continuity (following the proof of Lemma E.27, it is 0 here). Then for each i = 1, . . . , 6, one has: 1. For each i, there is a µ⊗µ-integrable function f i : R n ×R n → R such that |Ξ i ( • , g 1 , g 2 )- E[Ξ i ( • , g 1 , g 2 )]| ≤ f i (g 1 , g 2 ); 2. There is an absolute constant C i > 0 such that for every 0 ≤ ν ≤ π, one has f i L 4 ≤ C i , so that in particular Ξ i (ν, • , • ) -E[Ξ i (ν, • , • )] L 4 ≤ C i . Proof. First we reduce to noncentered fourth moment calculations. If X is a random variable with finite fourth moment, we have by Minkowski's inequality X -E[X] L 4 ≤ X L 4 + |E[X]|, so that the triangle inequality for the expectation and the Lyapunov inequality imply X -E[X] L 4 ≤ 2 X L 4 . We can therefore control the noncentered fourth moments of the random variables Ξ i and pay only an extra factor of 2 in controlling the centered moments. For the proofs of property 1, we have similarly |X -E[X]| ≤ |X| + E[|X|] from the triangle inequality, so that it again suffices to prove property 1 for the noncentered random variables |Ξ i |. Ξ 1 control. If ν = 0 or ν = π, the integrand is identically zero; we proceed assuming 0 < ν < π. Using ψ ≥ 1 4 , we have 0 ≤ Ξ 1 (ν, g 1 , g 2 ) ≤ 16 n i=1 σ(g 1i ) 3 ρ(-g 1i cot ν) sin 3 ν . For property 1, by elementary properties of cos we have for 0 ≤ ν ≤ π/4 and 3π/4 ≤ ν ≤ π that cos 2 ν ≥ 1 2 , so ρ(-g 1i cot ν) ≤ n 4π e - ng 2 1i 8 sin 2 ν . This gives σ(g 1i ) 3 ρ(-g 1i cot ν) sin 3 ν ≤ |g 1i | 3 ρ(-g 1i cot ν) sin 3 ν = 2 π K 1/2 g 1i sin ν 3 e -K g1i sin ν 2 , where we define K = n/8. By Lemma E.36, we have that g ≤ g( 3/2K) = CK -3/2 , where C > 0 is an absolute constant. We conclude σ(g 1i ) 3 ρ(-g 1i cot ν) sin 3 ν ≤ C/n, (E.52) provided ν is not in [π/4, 3π/4]. On the other hand, if π/4 ≤ ν ≤ 3π/4, we have sin ν ≥ 1/ √ 2, so that σ(g 1i ) 3 ρ(-g 1i cot ν) sin 3 ν ≤ C √ nσ(g 1i ) 3 , (E.53) where C > 0 is an absolute constant. Since these ν constraints cover [0, π], we have for all ν and all g 1 (by the triangle inequality) |Ξ 1 (ν, g 1 , g 2 )| ≤ C + C n 3/2 σ(g 1i ) 3 , where C, C > 0 are absolute constants, and by Lemma G.11, we have E[C + C n 3/2 σ(g 1i ) 3 ] = C + C , where C > 0 is an absolute constant. This proves property 1 with f 1 (g 1 , g 2 ) = C+C n 3/2 σ(g 1i ) 3 , with different absolute constants, and property 2 follows from Lemma G.11 after applying the Minkowski inequality and calculating the integral, which has the necessary cancellation of the n 3/2 factor. Ξ 2 control. By Lemma E.31, we have |ψ | ≤ C for an absolute constant C > 0 and x/ψ(x) ≤ 2. Cauchy-Schwarz then implies v 0 , v s ψ ( v s 2 ) v s 2 ψ( v 0 2 )ψ( v s 2 ) 2 ≤ 8C. In an exactly analogous manner, we have v 0 , v s ψ( v 0 2 )ψ( v s 2 ) ≤ 4. Both bounds satisfy the requirements of property 1, with f 2 (g 1 , g 2 ) = 16C + 8. The triangle inequality and Minkowski's inequality then implies Ξ 1 (ν, • , • ) L 4 ≤ C . Ξ 3 control. By Lemma E.31, we have |ψ | ≤ C for an absolute constant C > 0, ψ ≥ 1 4 , and x/ψ(x) ≤ 2. Cauchy-Schwarz then implies v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 2 ≤ 16C vs 2 2 , and the triangle inequality gives vs 2 2 ≤ g 1 2 2 + g 2 2 2 + 2 g 1 2 g 2 2 , whose expectation is bounded by 4, by the Schwarz inequality and Lemma G.11. We can therefore take f 3 (g 1 , g 2 ) = C + C ( g 1 2 + g 2 2 ) 2 , and we have ( g 1 2 + g 2 2 ) 2 L 4 = g 1 2 + g 2 2 2 L 8 ≤ ( g 1 2 L 8 + g 2 2 L 8 ) 2 ≤ C, where C > 0 is a (new) absolute constant, by the Minkowski inequality and lemmas Lemmas G.10 and G.11. This establishes property 2. Ξ 4 control. By Lemma E.31, we have |ψ | ≤ C for an absolute constant C > 0, ψ ≥ 1 4 , and x/ψ(x) ≤ 2; Cauchy-Schwarz then implies 2 v 0 , vs v s , vs ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 ≤ 64C vs 2 2 . Following the argument for Ξ 3 exactly, we conclude property 1 and 2 from this bound with a suitable modification of the constant. Ξ 5 control. We have v 0 , v s vs 2 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 2 ≤ 32C vs 2 2 , following exactly the setup and instantiations in the argument for Ξ 4 . Following the argument for Ξ 3 exactly, we conclude property 1 and 2 from this bound with a suitable modification of the constant. Ξ 6 control. The triangle inequality gives |Ξ 6 (s, g 1 , g 2 )| ≤ 2 v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 3 v s 2 2 + v 0 , v s v s , vs 2 ψ ( v s 2 ) ψ( v 0 2 )ψ( v s 2 ) 2 v s 3 2 , and following the setup of Ξ 4 and Ξ 5 control gives |Ξ 6 (ν, g 1 , g 2 )| ≤ 128C vν 2 2 + 32C vν 2 2 . Following the argument for Ξ 3 exactly, we conclude property 1 and 2 from this bound with a suitable modification of the constant. Lemma E.38. In the notation of Lemma E.13, there are absolute constants c, c , C > 0 and an absolute constant K > 0 such that if n ≥ K , there is an event with probability at least 1 -2e -cn on which one has v 0 2 2 ψ( v 0 2 ) 2 -E v 0 2 2 ψ( v 0 2 ) 2 ≤ Ce -c n . Proof. There is no ν dependence in this term, so we need only prove a single bound. Following the proof of the measure bound in Lemma E.16, but using only the pointwise concentration result, we assert that if n ≥ C an absolute constant there is an event E on which 0.5 ≤ v 0 2 ≤ 2 with probability at least 1 -2e -cn with c > 0 an absolute constant. This implies that if g 1 ∈ E we have v 0 2 2 ψ( v 0 2 ) 2 = 1, which we can use together with nonnegativity of the integrand to calculate E v 0 2 2 ψ( v 0 2 ) 2 = E[1 E ] + E 1 E c v 0 2 ψ( v 0 2 ) 2 ≥ E[1 E ] ≥ 1 -2e -cn , whence v 0 2 2 ψ( v 0 2 ) 2 -E v 0 2 2 ψ( v 0 2 ) 2 ≤ 2e -cn whenever g 1 ∈ E. Similarly, we calculate E v 0 2 2 ψ( v 0 2 ) 2 = E[1 E ] + E 1 E c v 0 2 ψ( v 0 2 ) 2 ≤ 1 + E[1 E c ] 1/2 E v 0 2 ψ( v 0 2 ) 4 1/2 ≤ 1 + 16Ce -cn , applying the Schwarz inequality, property 2 in Lemma E.31, and the measure bound on E, with c , C > 0 absolute constants, whence E v 0 2 2 ψ( v 0 2 ) 2 - v 0 2 2 ψ( v 0 2 ) 2 ≤ 16C e -c n whenever g 1 ∈ E. Worst-casing constants, we conclude v 0 2 2 ψ( v 0 2 ) 2 -E v 0 2 2 ψ( v 0 2 ) 2 ≤ Ce -cn when g 1 ∈ E, which is sufficient for our purposes. Lemma E.39. In the notation of Lemma E.13, if d ≥ 1, there are absolute constants c, c , c , c , C, C , C , C , C > 0 and absolute constants K, K > 0 such that if n ≥ Kd 4 log 4 n and d ≥ K , there is an event with probability at least 1 -C n -c d/2 -C ne -c n on which one has |Ξ 1 (ν, g 1 , g 2 ) -E[Ξ 1 (ν, g 1 , g 2 )]| ≤ C d log n n + C n -c d + C ne -c n . Proof. If ν ∈ {0, π}, then Ξ 1 (ν, g 1 , g 2 ) = 0 for every (g 1 , g 2 ); we therefore assume 0 < ν < π below. We will apply Lemma E.34 to begin, with the instantiations X i = σ(g 1i ) 3 ρ(-g 1i cot ν) sin 3 ν , Y i = 1 ψ( v 0 2 )ψ( v i ν 2 ) , since then Ξ 1 (ν, g 1 , g 2 ) = i X i Y i . We have X i ≥ 0; writing k 2 = 2/n, we calculate E[X i ] = 1 √ 8πk 2 1 √ 2πk 2 R g 3 sin 3 ν exp - 1 2k 2 g 2 sin 2 ν dg = 2 πn sin ν (E.54) where the second line uses the change of variables g → g sin ν and Lemma G.11. Additionally, we have E[X 2 i ] = k 4 4π 1 √ 2π R g 6 sin 6 ν exp - 1 2 g 2 (1 + 2 cot 2 ν) dg = k 4 sin ν 4π 1 √ 2π R g 6 exp - 1 2 g 2 (1 + cos 2 ν) dg = k 4 sin ν 4π(1 + cos 2 ν) 7/2 1 √ 2π R g 6 e -g 2 /2 dg = 15 sin ν πn 2 (1 + cos 2 ν) 7/2 , (E.55) where in the second line we change variables g → g sin ν, in the third line we change variables g → g/ √ 1 + cos 2 ν, and in the fourth line we use Lemma G.11. We can calculate the derivative of the map g(ν) = (1 + cos 2 ν) -7/2 sin ν as g (ν) = cos(ν)(1 + cos 2 ν) -7 [(1 + cos 2 ν) 7/2 + 7 sin 2 (ν)(1 + cos 2 ν) 5/2 ], which evidently has the same sign as cos(ν); so g is strictly increasing below π/2 and strictly decreasing above it, and is therefore maximized at g(π/2). We conclude the bound E[X 2 i ] ≤ 15 πn 2 , (E.56) which shows that i X i L 2 = O(1) . Next, we have Y i ≤ 16 by Lemma E.31, so by the Minkowski inequality Y i -1 L 4 ≤ 17 for each i, and it remains to control deviations. We consider the event E = E 0.5,1 in the notation of Lemma E.16, which has probability at least 1 -Cne -cn and on which we have 1 2 ≤ v i ν 2 ≤ 2 for all i ∈ [n] and in particular 1 2 ≤ v 0 2 , and thus by Lemma E.31 Y i = 1 v 0 2 v i ν 2 for all i ∈ [n] . By Taylor expansion with Lagrange remainder of the smooth function x → x -1 on the domain x > 0 about the point 1, we have 1 x = 1 -(x -1) + 1 ξ 3 (x -1) 2 , where ξ lies between 1 and x. If (g 1 , g 2 ) ∈ E, then for all i v 0 3 2 v i ν 3 2 ≥ (1/64), and we can therefore assert v 0 2 v i ν 2 -1 -64 v 0 2 v i ν 2 -1 2 ≤ 1 -Y i ≤ v 0 2 v i ν 2 -1 . (E.57) By Gauss-Lipschitz concentration, we have P[| v 0 2 -E[ v 0 2 ]| ≥ t] ≤ 2e -cnt 2 and P[| v 0 2 - E[ v i ν 2 ]| ≥ t] ≤ 2e -cnt 2 . Lemma E.19 implies that 1 -2/(n -1) ≤ E[ v i ν 2 ] ≤ 1 and 1 -2/n ≤ E[ v 0 2 ] ≤ 1, so we can conclude when n ≥ d and when n is larger than a constant that | v 0 2 -1| ≤ C d n ; ∀i ∈ [n], v i ν 2 -1 ≤ C d n with probability at least 1 -C ne -d , by a union bound. Using then the fact that v i ν 2 ≤ 2 for all i on the event E together with the previous estimates and (E.57), we obtain with probability at least 1 -C ne -cn -C ne -d (via a union bound with the measure of E) that for all i, C d n -C d n ≤ 1 -Y i ≤ C d n . As long as n ≥ d, we conclude that with the same probability, for all i we have |Y i -1| ≤ C d/n. We can therefore apply Lemma E.34 to get that with probability at least 1 -C ne -cn -C ne -d we have |Ξ 1 (ν, g 1 , g 2 ) -E[Ξ 1 (ν, g 1 , g 2 )]| ≤ 2 n i=1 σ(g 1i ) 3 ρ(-g 1i cot ν) sin 3 ν -E σ(g 1i ) 3 ρ(-g 1i cot ν) sin 3 ν + C d n + (C ) 1/4 ne -cn/4 + (C ) 1/4 ne -d/4 , (E.58) where we also used the triangle inequality for the 4 norm to simplify the fourth root term, together with n ≥ 1. For ν ∈ [0, π], we define f ν : R → R by f ν (g) = σ(g) 3 √ 2πk 2 sin 3 ν exp - 1 2k 2 g 2 cot 2 ν , so that the task that remains is to control | i f ν (g 1i ) -E[f ν (g 1i )]|. We start by applying Lemma E.36 to obtain an estimate f ν (g) ≤ C n|cos ν| 3 , where C > 0 is an absolute constant. When 0 ≤ ν ≤ π/4 or 3π/4 ≤ ν ≤ π, we have therefore f ν (g) ≤ C/n. Meanwhile, if π/4 ≤ ν ≤ 3π/4, we have f ν (g) ≤ C √ nσ(g) 3 , so we can conclude f ν (g) ≤ C/n + C √ nσ(g) 3 for all ν, which shows that f ν (g) is not much larger than C √ nσ(g) 3 . Next, let ḡ ∼ N (0, 1), so that g d = kḡ; we have for any t ≥ 0 P C √ nσ(g) 3 ≥ t = P σ(ḡ) ≥ C (nt) 1/3 ≤ exp -1 2 (C ) 2 (nt) 2/3 , where we use the classical estimate P[ḡ ≥ t] ≤ e -t 2 /2 , valid for t ≥ 1, and accordingly require t ≥ (C ) -3 n -1 . In particular, there is an absolute constant C > 0 such that we have P C √ nσ(g) 3 ≥ C √ nd = P σ(ḡ) ≥ n d 1/6 ≤ exp - 1 2 n d 1/3 ≤ e -d , where the last inequality holds in particular when n ≥ 8d 4 (and this condition implies what is necessary for the second to last to hold when d ≥ 1). Returning to our bound on f ν , we note that when n ≥ (C/C ) 2 d, we have that f ν (g) - 2C √ nd ≤ C n + C √ nσ(g) 3 - 2C √ nd ≤ C √ nσ(g) 3 - C √ nd , from which we conclude that when our previous hypotheses on n are in force P f ν (g) ≥ 2C √ nd ≤ e -d . (E.59) We are going to use this result to control | i f ν (g 1i ) -E[f ν (g 1i )]| using a truncation approach. Define M = 2C / √ nd, where C > 0 is the absolute constant in (E.59). We write using the triangle inequality n i=1 f ν (g 1i ) -E[f ν (g 1i )] ≤ n i=1 f ν (g 1i ) -f ν (g 1i )1 fν (g1i)≤M + n i=1 f ν (g 1i )1 fν (g1i)≤M -E f ν (g 1i )1 fν (g1i)≤M + n i=1 E f ν (g 1i )1 fν (g1i)≤M -E[f ν (g 1i )] . By (E.59) and a union bound, we have with probability at least 1 -ne -d n i=1 f ν (g 1i ) -f ν (g 1i )1 fν (g1i)≤M = 0. Moreover, we calculate n i=1 E f ν (g 1i )1 fν (g1i)≤M -E[f ν (g 1i )] ≤ n i=1 E f ν (g 1i )1 fν (g1i)>M ≤ n i=1 P[f ν (g 1i ) > M ] 1/2 f ν (g 1i ) L 2 ≤ Ce -d/2 for an absolute constant C > 0, using in the second line the Schwarz inequality, and in the third line (E.56) and (E.59). The second term can be controlled with Lemma G.3, together with the observation that n i=1 E 1 fν (g1i)≤M f ν (g 1i ) -E 1 fν (g1i)≤M f ν (g 1i ) 2 = n i=1 E 1 fν (g1i)≤M f ν (g 1i ) 2 -E 1 fν (g1i)≤M f ν (g 1i ) 2 ≤ n i=1 E f ν (g 1i ) 2 ≤ C/n, where the last inequality is due to (E.56). Lemma G.3 thus gives for any t ≥ 0 P n i=1 f ν (g 1i )1 fν (g1i)≤M -E f ν (g 1i )1 fν (g1i)≤M ≥ t ≤ 2 exp - t 2 /2 Cn -1 + M t/3 . It follows that there is an absolute constant C > 0 such that P n i=1 f ν (g 1i )1 fν (g1i)≤M -E f ν (g 1i )1 fν (g1i)≤M ≥ C d n ≤ 2e -d , and therefore with probability at least 1 -2ne -d (by a union bound) we have n i=1 f ν (g 1i ) -E[f ν (g 1i )] ≤ C d n + C e -d/2 . Combining with (E.58) using a union bound and worst-casing constants in the exponent, we conclude that with probability at least 1 -C ne -c d -C ne -c n , we have |Ξ 1 (ν, g 1 , g 2 ) -E[Ξ 1 (ν, g 1 , g 2 )]| ≤ C d n + C e -cd + C ne -c n . Aggregating our hypotheses on n, there are absolute constants C 1 , C  -C n -c d/2 -C ne -c n , we have |Ξ 1 (ν, g 1 , g 2 ) -E[Ξ 1 (ν, g 1 , g 2 )]| ≤ C d log n n + C n -c d + C ne -c n , which is the desired type of bound. Lemma E.40. In the notation of Lemma E.13, there are absolute constants c, C, C , C > 0 such that for any δ ≥ 3/2, we have P E g2 [Ξ 1 (ν, g 1 , g 2 )] -E g1,g2 [Ξ 1 (ν, g 1 , g 2 )] is C + C n 1+δ -Lipschitz ≥ 1 -2e -cn -C n -δ . Proof. Write f (ν, g 1 ) = Eg 2 [Ξ 1 (ν, g 1 , g 2 )] ; it will suffice to differentiate f and E[f ] with respect to ν, bound the derivatives on an event of high probability, and apply the triangle inequality to obtain a high-probability Lipschitz estimate for |E g2 [Ξ 1 (ν, g 1 , g 2 )] -Eg 1 ,g2 [Ξ 1 (ν, g 1 , g 2 )]|. Define k = 2/n. For fixed (g 1 , g 2 ), the function q(ν, g 1 , g 2 ) = n i=1 σ(g 1i ) 3 ρ(-g 1i cot ν) ψ( v 0 2 )ψ( v i ν 2 ) sin 3 ν is differentiable at all but at most n points of (0, ν), using Lemma E.31 to see that the only obstruction to differentiability is the function σ in the term v i ν 2 ; and there has derivative q (ν, g 1 , g 2 ) = n i=1 σ(g 1i ) 3 √ 2πk 2 ψ( v 0 2 )     g 2 1i cos ν k 2 ψ( v i ν 2) sin 6 ν - ψ ( v i ν 2) v i ν , vi ν ψ( v i ν 2) 2 v i ν 2 sin 3 ν -3 cos ν ψ( v i ν 2) sin 4 ν     exp - 1 2k 2 g 2 1i cot 2 ν . The triangle inequality and Lemma E.31 yield |q (ν, g 1 , g 2 )| ≤ 4 √ 2πk 2 n i=1 |g 1i | 3 4g 2 1i k 2 sin 6 ν + 16C vi ν 2 sin 3 ν + 12 sin 4 ν exp - 1 2k 2 g 2 1i cot 2 ν (E.60) for C > 0 an absolute constant. We have vi ν 2 ≤ g 1 2 + g 2 2 by the triangle inequality, so to obtain a (ν, g 2 )-integrable upper bound it suffices to remove the ν dependence from the previous estimate. We argue as follows: if 0 ≤ ν ≤ π/4 or 3π/4 ≤ ν ≤ π, we have cos 2 ν ≥ 1 2 , and so for any p ≥ 3 exp -1 2k 2 g 2 1i cot 2 ν sin p ν ≤ exp g 2 1i 4k 2 1 sin 2 t sin -p ν. (E.61) By Lemma E.36, where we put C = g 2 1i /4k 2 and therefore have to require that g 1i = 0 for all i ∈ [n] (a set of measure zero in R n ), this yields |q (ν, g 1 , g 2 )| ≤ C g 2 2 , (E.62) where C > 0 is a constant depending only on n and g 1 . In cases where g 1i = 0 for some i, we note that the bound (E.60) is then equal to zero, which also satisfies the estimate (E.62). On the other hand, when π/4 ≤ ν ≤ 3π/4, then sin t ≥ 2 -1/2 , and we can assert for any p ≥ 3 exp -1 2k 2 g 2 1i cot 2 ν sin p ν ≤ 2 p/2 . By the triangle inequality, this too implies |q (ν, g 1 , g 2 )| ≤ C g 2 2 , where C > 0 is a constant depending only on n and g 1 . Invoking then Lemma G.9, we conclude that q is absolutely integrable over [0, π]×R n , so that an application of Fubini's theorem and (Cohn, 2013, Theorem 6.3.11) gives the Taylor expansion f (ν, g 1 ) = f (0, g 1 ) + ν 0 Eg 2 [q (t, g 1 , g 2 )] dt. Next, we show also that q is absolutely integrable over [0, π] × R n × R n , which implies that E[f (ν, g 1 )] = E[f (0, g 1 )] + ν 0 E[q (t, g 1 , g 2 )] dt as well. Starting from (E.60), we have E[|q (ν, g 1 , g 2 )|] ≤ E 4 √ 2πk 2 n i=1 |g 1i | 3 4g 2 1i k 2 sin 6 ν + 16C( g i 1 2 + g i 2 2 ) sin 3 ν + 12 sin 4 ν exp - 1 2k 2 g 2 1i cot 2 ν , and the expectation factors over g 1i , g i 1 , g i 2 , so we can separately compute the g 1i integrals first. For the first of the three terms on the RHS of the previous expression, we have E g1i |g 1i | 5 sin 6 ν exp - 1 2k 2 g 2 1i cot 2 ν = 1 √ 2πk 2 R |g| 5 sin 6 ν exp - 1 2k 2 g 2 / sin 2 ν dg = 1 √ 2πk 2 R |g| 5 exp - 1 2k 2 g 2 dg, (E.63) after the change of variables g → g sin ν in the integral, which is valid whenever 0 < ν < π. To take care of the case where ν = 0 or ν = π, we can use the estimate (E.61), valid for ν sufficiently close to 0 or π, and the assumption g 1i = 0 for all i to conclude that lim ν 0 q (ν, g 1 , g 2 ) = 0 for any such fixed (g 1 , g 2 ), and by symmetry the analogous result lim ν π q (ν, g 1 , g 2 ) = 0; and whenever for some i we have g 1i = 0, we use (E.60) to see that the term in the sum involving g 1i poses no problems as ν 0 or ν π because it is identically 0. Returning to the integral (E.63), we have after a change of variables 1 √ 2πk 2 R |g| 5 exp - 1 2k 2 g 2 dg = k 5 √ 2π R |g| 5 exp - 1 2 g 2 dg = Ck 5 , where C > 0 is an absolute constant, and where we use Lemma G.11 for the last equality. The remaining two terms can be treated using the same argument: we get E g1i |g 1i | 3 sin 3 ν exp - 1 2k 2 g 2 1i cot 2 ν = C k 3 (after using |sin ν| ≤ 1) and E g1i |g 1i | 3 sin 4 ν exp - 1 2k 2 g 2 1i cot 2 ν = C k 3 for absolute constants C , C > 0. Combining these estimates gives E[|q (ν, g 1 , g 2 )|] ≤ C n n i=1 E g i i ,g i 2 g i 1 2 + g i 2 2 , and using Lemma G.9 (or equivalently Jensen's inequality) gives finally E[|q (ν, g 1 , g 2 )|] ≤ C n -1 n ≤ C. To conclude, we need to show that Eg 2 [q (ν, g 1 , g 2 )] is uniformly bounded by a polynomial in n with high probability. For this we start from the estimate (E.60) and apply the argument following that, but with more care in tracking the constants: if ν is within π/4 of either 0 or π, we can assert |q (ν, g 1 , g 2 )| ≤ C k n i=1 C 1 k 4 |g 1i | + C 2 k 3 ( g i 1 2 + g i 2 2 ) + C 3 k 4 |g 1i | whenever g 1i = 0 for every i (a set of full measure); and when ν is within π/4 of π/2, we can assert |q (ν, g 1 , g 2 )| ≤ C k n i=1 C 1 |g 1i | 5 k 2 + C 2 |g 1i | 3 ( g i 1 2 + g i 2 2 ) + C 3 |g 1i | 3 , where C i , C i > 0 are absolute constants. By the triangle inequality, independence, and Lemma G.9, when we consider |E g2 [q (ν, g 1 , g 2 )]|, the term E[ g i 2 2 ] is bounded by an absolute constant. Additionally, by Gauss-Lipschitz concentration and Lemma G.9, we have that simultaneously for all i g i 1 2 ≤ g 1 2 ≤ 2 with probability at least 1 -2e -cn . Moreover, since g 1 ∞ ≤ g 1 2 we also have control of the magnitude of each |g 1i | on this event, so with probability at least 1 -2e -cn we have for every ν E g2 [q (ν, g 1 , g 2 )] ≤ C k n i=1 C 1 k 4 |g 1i | + C 2 k 3 + C 3 k 2 + C 4 for absolute constants C, C i > 0. If X ∼ N (0, 1), we have for any t ≥ 0 that P[|X| ≥ t] ≥ 1 -Ct, where C > 0 is an absolute constant; so if X i ∼ i.i.d. N (0, 1), we have by independence and if t is less than an absolute constant P [∀i, |X i | ≥ t] ≥ (1 -Ct) n ≥ 1 -C nt, where the last inequality uses the numerical inequality e -2t ≤ 1 -t ≤ e -t , valid for 0 ≤ t ≤ 1 2 . From this expression, we conclude that when 0 ≤ t ≤ cn -1/2 for an absolute constant c > 0, we have P[∀i ∈ [n], |g 1i | ≥ t] ≥ 1 -Cn 3/2 t, so choosing in particular t = cn -(δ+ 2 ) for any δ > 0, we conclude that P [∀i ∈ [n], |g 1i | ≥ cn -3/2-δ ] ≥ 1 -C n -δ . Consequently, for any δ > 0 we have with probability at least 1 - C n -δ -2e -cn E g2 [q (ν, g 1 , g 2 )] ≤ C k n i=1 C 1 k 4 n 3/2+δ + C 2 k 3 + C 3 k 2 + C 4 , and since k = 2/n, this yields |E g2 [q (ν, g 1 , g 2 )]| ≤ C 1 n 1+δ + C 2 + C 3 n 5/2 + C 4 n 3/2 with the same probability. Consequently we can conclude that for any δ ≥ 3/2, we have P E g2 [Ξ 1 (ν, g 1 , g 2 )] -E g1,g2 [Ξ 1 (ν, g 1 , g 2 )] is C + C n 1+δ -Lipschitz ≥ 1 -2e -cn -C n -δ . Lemma E.41. In the notation of Lemma E.13, if d ≥ 1, there are absolute constants c, c C, C > 0 and an absolute constant K > 0 such that if n ≥ K, there is an event with probability at least 1 -Ce -cn on which ∀ν ∈ [0, π], |Ξ 2 (ν, g 1 , g 2 ) -E[Ξ 2 (ν, g 1 , g 2 )]| ≤ C e -c n . Proof. Let E denote the event E 0.5,0 in Lemma E.16; then by that lemma, E has probability at least 1 -Ce -cn as long as n ≥ C , where c, C, C > 0 are absolute constants, and for (g 1 , g 2 ) ∈ E, one has for all ν ∈ [0, π] Ξ 2 (ν, g 1 , g 2 ) = 0. This allows us to calculate, for each ν, E[Ξ 2 (ν, g 1 , g 2 )] = E[1 E c Ξ 2 (ν, g 1 , g 2 )] ≤ E[1 E c ] 1/2 Ξ 2 (ν, • ) L 2 ≤ C e -c n , after applying Lemma E.37 and Lyapunov's inequality and worst-casing constants. We conclude that with probability at least 1 -Ce -cn ∀ν ∈ [0, π], |Ξ 2 (ν, g 1 , g 2 ) -E[Ξ 2 (ν, g 1 , g 2 )]| ≤ C e -c n . Lemma E.42. In the notation of Lemma E.13, if d ≥ 1, there are absolute constants c, c C, C > 0 and an absolute constant K > 0 such that if n ≥ K, there is an event with probability at least 1 -Ce -cn on which ∀ν ∈ [0, π], |Ξ 3 (ν, g 1 , g 2 ) -E[Ξ 3 (ν, g 1 , g 2 )]| ≤ C e -c n . Proof. The argument is identical to Lemma E.41. Let E denote the event E 0.5,0 in Lemma E.16; then by that lemma, E has probability at least 1 -Ce -cn as long as n ≥ C , where c, C, C > 0 are absolute constants, and for (g 1 , g 2 ) ∈ E, one has for all ν ∈ [0, π] Ξ 3 (ν, g 1 , g 2 ) = 0. This allows us to calculate, for each ν, E[Ξ 3 (ν, g 1 , g 2 )] = E[1 E c Ξ 3 (ν, g 1 , g 2 )] ≤ E[1 E c ] 1/2 Ξ 3 (ν, • ) L 2 ≤ C e -c n , after applying Lemma E.37 and Lyapunov's inequality and worst-casing constants. We conclude that with probability at least 1 -Ce -cn ∀ν ∈ [0, π], |Ξ 3 (ν, g 1 , g 2 ) -E[Ξ 3 (ν, g 1 , g 2 )]| ≤ C e -c n . Lemma E.43. In the notation of Lemma E.13, if d ≥ 1, there are absolute constants c, C, C , C > 0 and absolute constants K, K > 0 such that if n ≥ Kd log n and d ≥ K , there is an event with probability at least 1 -Ce -cn -C n -d on which one has ∀ν ∈ [0, π], |Ξ 4 (ν, g 1 , g 2 ) -E[Ξ 4 (ν, g 1 , g 2 )]| ≤ C d log n n . Proof. We are going to control the expectation first, showing that it is small; then prove that |Ξ 4 | is small uniformly in ν. Let E denote the event E 0.5,0 in Lemma E.16; then by that lemma, E has probability at least 1 -Ce -cn as long as n ≥ C , where c, C, C > 0 are absolute constants, and for (g 1 , g 2 ) ∈ E, one has for all ν ∈ [0, π] Ξ 4 (ν, g 1 , g 2 ) = -2 v 0 , vν v ν , vν v 0 2 v ν 3 2 . Thus, if we write Ξ 4 (ν, g 1 , g 2 ) = -21 E (g 1 , g 2 ) v 0 , vν v ν , vν v 0 2 v ν 3 2 , we have Ξ 4 = Ξ 4 for all ν whenever (g 1 , g 2 ) ∈ E, so that for any ν |E[Ξ 4 (ν, g 1 , g 2 )]| = E[ Ξ 4 (ν, g 1 , g 2 )] + E[1 E c Ξ 4 (ν, g 1 , g 2 )] ≤ |E[ Ξ 4 (ν, g 1 , g 2 )]| + Ce -cn , where the second line uses the triangle inequality and the Schwarz inequality and Lemma E.37 together with the Lyapunov inequality. We proceed with analyzing the expectation of Ξ 4 . Using the Schwarz inequality gives  E Ξ 4 (ν, g 1 , g 2 ) ≤ 2E v 0 , vν 2 v ν , vν 2 1/2 E 1 E v 0 2 2 v ν 6 2 1/2 ≤ 32E v 0 , vν 2 v ν , vν )] = 0, which implies E v 0 , vν 2 v ν , vν 2 ≤ C/n, from which we conclude |E[Ξ 4 (ν, g 1 , g 2 )]| ≤ C/ √ n. Next, we control the deviations of Ξ 4 with high probability. By Lemma E.17, there is an event E a with probability at least 1 -e -cn on which vν 2 ≤ 4 for every ν ∈ [0, π]. Therefore on the event E b = E ∩ E a , which has probability at least 1 -Ce -cn by a union bound, we have using Cauchy-Schwarz that for every ν |Ξ 4 (ν, g 1 , g 2 )| ≤ 256| v ν , vν |. The coordinates of the random vector v ν vν are σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν), and we note E[σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν)] = -E[σ(g 1i )g 2i ] = 0, by rotational invariance. Moreover, the calculation (E.35) together with Lemmas G.11 and E.17 demonstrates subexponential moment growth with rate C/n, so Lemma G.2 implies for t ≥ 0 P[ v ν , vν ≥ t] ≤ 2e -cnt min{c t,1} . For large enough n, this gives P v ν , vν ≥ C d log n n ≤ 2n -2d . We turn to the uniformization of this pointwise bound. The map ν → i σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν) is continuous, and differentiable at all but finitely many points of [0, π] (following the zero crossings argument in the proof of Lemma E.22) with derivative ν → i σ(g 1i cos ν + g 2i sin ν)(g 2i cos ν -g 1i sin ν) 2 -σ(g 1i cos ν + g 2i sin ν) 2 , which is evidently integrable using the triangle inequality and Lemma G.11. In particular, we can write the derivative as vν 2 2 -v 0 2 2 . Thus, by (Cohn, 2013, Theorem 6.3.11) , to get a Lipschitz estimate on ν → v ν , vν it suffices to bound the magnitude of the derivative ν → vν As long as d ≥ 1 2 , we have that this probability is at least 1 -Ce -cn -C n -d , and so the triangle inequality and a union bound yield finally that with probability at least 1 -Ce -cn -C n -d ∀ν ∈ [0, π], |Ξ 4 (ν, g 1 , g 2 ) -E[Ξ 4 (ν, g 1 , g 2 )]| ≤ C d log n n . Lemma E.44. In the notation of Lemma E.13, if d ≥ 1, there are absolute constants c, c , c , C, C , C , C , C > 0 and an absolute constant K > 0 such that if n ≥ Kd log n, there is an event with probability at least 1 -Ce -cn + C e -d on which one has |Ξ 5 (ν, g 1 , g 2 ) -E[Ξ 5 (ν, g 1 , g 2 )]| ≤ C d n + C e -c d + C e -c n , Proof. Fix ν ∈ [0, π]. Let E denote the event E 0.5,0 in Lemma E.16; then by that lemma, E has probability at least 1 -Ce -cn as long as n ≥ C , where c, C, C > 0 are absolute constants, and for (g 1 , g 2 ) ∈ E, one has for all ν ∈ [0, π] Ξ 5 (ν, g 1 , g 2 ) = - v 0 , v ν vν 2 2 v 0 2 v ν 3 2 . Thus, if we write Ξ 5 (ν, g 1 , g 2 ) = -1 E (g 1 , g 2 ) v 0 , v ν vν 2 2 v 0 2 v ν 3 2 we have Ξ 5 = Ξ 5 for any ν whenever (g 1 , g 2 ) ∈ E, so that by the triangle inequality, for any ν |Ξ 5 (ν, g 1 , g 2 ) -E[Ξ 5 (ν, g 1 , g 2 )]| ≤ Ξ 5 (ν, g 1 , g 2 ) -E Ξ 5 (ν, g 1 , g 2 ) + E[Ξ 5 (ν, g 1 , g 2 )] -E Ξ 5 (ν, g 1 , g 2 ) ≤ Ξ 5 (ν, g 1 , g 2 ) -E Ξ 5 (ν, g 1 , g 2 ) + E 1 E c Ξ 5 (ν, g 1 , g 2 ) -Ξ 5 (ν, g 1 , g 2 ) ≤ Ξ 5 (ν, g 1 , g 2 ) -E Ξ 5 (ν, g 1 , g 2 ) + Ce -cn , where the second line uses the triangle inequality, and the third line uses the Schwarz inequality and Lemma E.37 together with the Lyapunov inequality. So, we can proceed analyzing Ξ 5 . First, we aim to apply Lemma E.33 with the choices  X = -1 E v 0 , v ν v 0 2 v ν 2 ; Y = 1 E vν 2 2 v ν 2 2 , since XY = Ξ 5 (ν, • Y -1 L 2 ≤ 1 + Y L 2 ≤ 1 + 4E vν 4 2 1/2 ≤ 1 + 4 √ 1 + C, where C > 0 is an absolute constant; the first inequality is the Minkowski inequality, the second uses the property of E and drops the indicator by nonnegativity, and the third applies Lemma E.29, and discards the n -1 factor. For deviations, we start by noting that E[ v ν 2 2 ] = 1, and that by Lemmas G.2 and G.11, we have P v ν 2 2 -1 ≥ t ≤ 2e -cnt min{Ct,1} . It follows that there exists an absolute constant C > 0 such that, putting t = C d/n and choosing n ≥ (C /C) 2 d, we have P v ν 2 2 -1 ≥ C d n ≤ 2e -d . (E.65) Moreover, by Lemma E.17, we can run a similar argument on vν 2 2 to get that if n is larger than a constant multiple of d P vν 2 2 -1 ≥ C d n ≤ 2e -d . (E.66) Next, Taylor expansion with Lagrange remainder of the smooth function x → x -1 on the domain x > 0 about the point 1 gives 1 x = 1 -(x -1) + 1 ξ 3 (x -1) 2 , (E.67) where ξ lies between 1 and x. If (g 1 , g 2 ) ∈ E, then v ν 6 2 ≥ (1/64), and we can therefore assert 1 -v ν 2 2 -1 ≤ 1 v ν 2 2 ≤ 1 -v ν 2 2 -1 + 64 v ν 2 2 -1 2 with probability at least 1 -Ce -cn . Using a union bound together with (E.65) (and changing the constant to C), we have with probability at least 1 -2e -cd -C e -c n that -C d n -64C 2 d n ≤ 1 - 1 v ν 2 2 ≤ C d n . Given that n ≥ d, it follows that with the same probability we have -C(1 + 64C) d n ≤ 1 - 1 v ν 2 2 ≤ C d n , which implies that with probability at least 1 -2e -d -C e -cn , we have 1 - 1 v ν 2 2 ≤ C d n . Now, the triangle inequality gives vν 2 2 v ν 2 2 -1 ≤ vν 2 2 v ν 2 2 - 1 v ν 2 2 + 1 v ν 2 2 -1 ≤ 1 v ν 2 2 vν 2 2 -1 + 1 v ν 2 2 -1 . When (g 1 , g 2 ) ∈ E, we have v ν 2 2 ≥ 1 4 , so, by a union bound, with probability at least 1 -4e -d -C e -cn we have vν 2 2 v ν 2 2 -1 ≤ 4C d n , and since (g 1 , g 2 ) ∈ E =⇒ Y = vν 2 2 vν 2 2 , another union bound and the measure bound on E let us conclude that with probability at least 1 -4e -d -C e -cn , we have |Y -1| ≤ 4C d n . If we choose n ≥ (1/c)(d + log C /4), we have 4e -d + C e -cn ≤ 8e -d , so the previous bound occurs with probability at least 1 -8e -d . We can now apply Lemma E.33 to get with probability at least 1 -8e -d Ξ 5 -E Ξ 5 ≤ 1 E v 0 v 0 2 , v ν v ν 2 -E 1 E v 0 v 0 2 , v ν v ν 2 + C d n + C e -d/2 . Next, we attempt to apply Lemma E.33 again, this time to X = 1 E v 0 , v ν and Y = 1 E ( v 0 2 v ν 2 ) -1 . Using the definition of E, we have |X| ≤ 4 and Y -1 L 2 ≤ Y L 2 + 1 ≤ 5, where the second bound also leverages the Minkowski inequality; so we need only establish deviations of Y . Applying again (E.67), and using (g 1 , g 2 ) ∈ E implies v 0 2 v ν 2 ≥ 1 4 , we get ( v 0 2 v ν 2 -1) -64 ( v 0 2 v ν 2 -1) 2 ≤ 1 - 1 v 0 2 v ν 2 ≤ ( v 0 2 v ν 2 -1) (E.68) with probability at least 1 -Ce -cn . Using Lemma G.11 and (Vershynin, 2018, Theorem 3.1.1), we can assert for any ν ∈ [0, π] and any t ≥ 0 P[| v ν 2 -1| ≥ t] ≤ 2e -cnt 2 , which implies that there exists an absolute constant C > 0 such that for any d > 0 P | v ν 2 -1| ≥ C d n ≤ 2e -d . In particular, when n ≥ d, we can assert that v ν 2 ≤ 1 + C with probability at least 1 -2e -d . By the triangle inequality and a union bound, it follows | v 0 2 v ν 2 -1| ≤ v 0 2 | v ν 2 -1| + | v 0 2 -1| ≤ C d n with probability at least 1-6e -d . Then a union bound gives that with probability at least 1-6e -d -C e -cn , (E.68) leads to -C d n 1 + 64C d n ≤ 1 - 1 v 0 2 v ν 2 ≤ C d n , and using n ≥ d and worst-casing constants implies that with the same probability 1 - 1 v 0 2 v ν 2 ≤ C d n . Then since (g 1 , g 2 ) ∈ E =⇒ Y = ( v 0 2 v ν 2 ) -1 , another union bound gives that with probability at least 1 -6e -d -C e -cn we have |Y -1| ≤ C d/n. As in the previous step of the reduction, we can choose n ≥ (1/c)(d + log C /6) to get that 6e -d + C e -cn ≤ 12e -d , so that the previous bound occurs with probability at least 1 -12e -d . We can thus apply Lemma E.33, a union bound, and our previous work to get that with probability at least 1 -20e -d Ξ 5 -E Ξ 5 ≤ |1 E v 0 , v ν -E[1 E v 0 , v ν ]| + C d n + C e -d/2 . Whenever (g 1 , g 2 ) ∈ E, we have by the triangle inequality, the Schwarz inequality, and Lemmas E.16 and E.29 that |1 E v 0 , v ν -E[1 E v 0 , v ν ]| ≤ | v 0 , v ν -E[ v 0 , v ν ]| + |E[ v 0 , v ν ] -E[1 E v 0 , v ν ]| ≤ | v 0 , v ν -E[ v 0 , v ν ]| + Ce -cn , allowing us to drop the indicator. We have v 0 , v ν = i σ(g 1i )σ(g 1i cos ν + g 2i sin ν), which is a sum of independent random variables; following the argument at and around (E.36), we conclude moreover that these random variables are subexponential with rate C/n, where C > 0 is an absolute constant. We therefore obtain from Lemma G.2 the tail bound P[| v 0 , v ν -E[ v 0 , v ν ]| ≥ t] ≤ 2e -cnt min{Ct,1} , which, for a suitable choice of absolute constant C > 0 and choosing n ≥ C d, yields the deviations bounds P | v 0 , v ν -E[ v 0 , v ν ]| ≥ C d n ≤ 2e -d . Taking a final union bound (since we assumed throughout that (g 1 , g 2 ) ∈ E) gives that with probability at least 1 -Ce -cn + C e -d , one has |Ξ 5 (ν, g 1 , g 2 ) -E[Ξ 5 (ν, g 1 , g 2 )]| ≤ C d n + C e -c d + C e -c n , which is sufficient to conclude pointwise concentration as claimed for sufficiently large n after we put d = d log n and include extra log n factors in any points where we need to choose n larger than d. Lemma E.45. In the notation of Lemma E.13, there are absolute constants c, C, C , C , C > 0 such that for any δ ≥ 1 2 , one has P E g2 [Ξ 5 (ν, g 1 , g 2 )] -E g1,g2 [Ξ 5 (ν, g 1 , g 2 )] is C + C n 1+δ -Lipschitz ≥ 1 -C e -cn -C n -δ as long as δ ≥ 1 2 . Proof. We will differentiate with respect to ν the function f (ν, g 1 ) = -E g2 v 0 , v ν vν 2 2 ψ ( v ν 2 ) ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 , and construct an event on which f has size poly(n). We need to also differentiate the function E[f ( • , g 1 )]; for this we will additionally show that f (ν, • ) is absolutely integrable over the product [0, π]×R n ×R n , which allows us to apply Fubini's theorem to move both the g 1 and g 2 expectations under the ν integral in the first-order Taylor expansion we obtain. In particular, the derivative of E[f ( • , g 1 )] will in this way be shown to be E[f ( • , g 1 )], so that linearity and the triangle inequality imply a poly(n) magnitude bound for the derivative of Eg 2 [Ξ 5 ] -E[Ξ 5 ]. Define q i (ν, g 1 , g 2 ) = v 0 , v ν ψ ( v ν 2 )(g 2i cos ν -g 1i sin ν) 2 ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 , so that, for almost all g 1 , f (ν, g 1 ) = - n i=1 E g2 [q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν)]. For each fixed (g 1 , g 2 ) and each i, the only obstructions to differentiability of q i in ν arise from the function σ (using smoothness of ψ from Lemma E.31 and the fact that it is constant whenever v ν is small enough that nondifferentiability of • 2 could pose a problem); following the zerocrossings argument of Lemma E.22, q i fails to be differentiable at no more than n points of [0, π], and otherwise has derivative q i (ν, g 1 , g 2 ) = 1 ψ( v 0 2 ) v 0 , vν ψ ( v ν 2 )(g 2i cos ν -g 1i sin ν) 2 ψ( v ν 2 ) 2 v ν 2 + v 0 , v ν v ν , vν ψ ( v ν 2 )(g 2i cos ν -g 1i sin ν) 2 ψ( v ν 2 ) 2 v ν 2 2 -2 v 0 , v ν ψ ( v ν 2 )(g 2i cos ν -g 1i sin ν)(g 1i cos ν + g 2i sin ν) ψ( v ν 2 ) 2 v ν 2 -2 ψ ( v ν 2 ) 2 v ν , vν v 0 , v ν (g 2i cos ν -g 1i sin ν) 2 ψ( v ν 2 ) 3 v ν 2 2 - ψ ( v ν 2 ) v ν , vν v 0 , v ν (g 2i cos ν -g 1i sin ν) 2 ψ( v ν 2 ) 2 v ν 3 2 . (E.69) by the chain rule and the product rule. To conclude absolute continuity of q i ( • , g 1 , g 2 ), we need to show that q i is integrable; this follows from Cauchy-Schwarz, the integrability of v 0 2 , v ν 2 , vν 2 (Lemma E.17), the triangle inequality, and the Lemma E.31 estimates ψ ≥ 1 4 , |ψ | ≤ C, |ψ | ≤ C , and |ψ (x)/x| ≤ C for any x ∈ R (to see this last estimate, note that |ψ | is bounded on R, and use that ψ is constant whenever x ≤ 1 4 ). Then (Cohn, 2013, Theorem 6.3.11) implies that q i ( • , g 1 , g 2 ) is absolutely continuous with a.e. derivative q i . Next, we can write f (ν, g 1 ) = - n i=1 E g2j :j =i E g2i [q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν)] , using Lemma E.37 to see that Fubini's theorem can be applied. Our aim is now to apply Lemma E.27, so we need to check its remaining hypotheses. First, continuity of q i (ν, • ) follows from continuity of σ, smoothness of ψ, and the fact that the denominator never vanishes. Joint absolute integrability of q i and q i follows from our verification of absolute integrability of q i above, which produces a final upper bound that does not depend on ν (which is therefore integrable over [0, π] as well); the corresponding result for q i follows from Lemma E.37. Last, we need the growth estimate. We have from Lemma E.31 |q i (ν, g 1 , g 2 )| ≤ 32C(g 2i cos ν -g 1i sin ν) 2 ≤ 32C(|g 2i | + |g 1i |) 2 ≤ 32C|g 1i |(1 + |g 2i |) 2 , which is evidently quadratic in |g 2i | once |g 2i | ≥ 1. Consequently we can apply Lemma E.27 to differentiate f ( • , g 1 ); we get at almost all g 1 f (ν, g 1 ) = - n i=1 E g2j :j =i E g2i [q i (0, g 1 , g 2 ) σ(g 1i )] + E g2j :j =i ν 0 dt Eg 2i [q i (t, g 1 , g 2 ) σ(g 1i cos t + g 2i sin t)] -g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t , where g i 2 is the vector g 2 but with its i-th coordinate replaced by -g 1i cot t, and where ρ is the pdf of a N (0, 2/n) random variable. The changes in g i 2 drive updates to the terms in q i as follows: we have σ(g 1i cos ν + g 2i sin ν) becoming 0, and g 2i cos ν -g 1i sin ν becoming -g 1i / sin ν. Thus, we have -g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t = - g 3 1i v i 0 , v i t ψ ( v i t 2 )ρ(-g 1i cot t) ψ( v 0 2 )ψ( v i t 2 ) 2 v i t 2 sin 4 t , where the notation v i t is in use in the Ξ 1 control section and is defined in Lemma E.26, and v i 0 is defined here similarly (the R n-1 vector which is the projection of v 0 onto all but the i-th coordinates). Using Lemma E.31, we can further assert g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t ≤ 16C |g 1i | 3 ρ(-g 1i cot t) sin 4 t (E.70) where we use that v i 0 2 ≤ v 0 2 . For each fixed g 1 having no coordinates equal to zero, we write K i = |g 1i | > 0; if 0 ≤ t ≤ π/4 or 3π/4 ≤ t ≤ π, we have cos 2 t ≥ 1 2 , and so ρ(-g 1i cot t) sin 4 t ≤ n 4π sin -4 t exp K 2 i n 8 1 sin 2 t . Using Lemma E.36, we have ρ(-g 1i cot t) sin 4 t ≤ n 4π 16 K 2 i n 2 . On the other hand, when π/4 ≤ t ≤ 3π/4, then sin t ≥ 2 -1/2 , and we can assert ρ(-g 1i cot t) sin 4 t ≤ 8 n/π. We conclude for any t g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t ≤ C/(K i n 3/2 ) + C √ nK 3 i (E.71) for absolute constants C, C > 0, and this upper bound is integrable jointly over t and g 2 . We have checked previously the joint integrability of the q i terms when applying Lemma E.27, so we can therefore apply Fubini's theorem to get g 1 -a.s. f (ν, g 1 ) = -E g2 n i=1 q i (0, g 1 , g 2 ) σ(g 1i ) - ν 0 n i=1 E g2 q i (t, g 1 , g 2 ) σ(g 1i cos t + g 2i sin t) -g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t dt. Consequently, to conclude a Lipschitz estimate for f ( • , g 1 ) it suffices to control the quantity under the t integral in the previous expression. We will start by controlling the second term using Markov's inequality. Following (E.70), we calculate E g1,g2 g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t ≤ 8C n π E g1   |g 1i | 3 exp -n 4 g 2 1i cos 2 t sin 2 t sin 4 t   = 4Cn π R |g| 3 sin 4 t exp - n 4 g 2 sin 2 t dg = 4Cn π R |g| 3 exp - n 4 g 2 dg, where the last line follows from the change of variables g → g sin t in the integral. We can evaluate this integral with Lemma G.11, which gives a bound E g1,g2 g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t ≤ 128C πn , and therefore a bound of C > 0 an absolute constant on the sum over i. As a byproduct of this estimate, we can assert that the second term is jointly integrable over [0, π] × R n × R n , which allows us to apply Fubini's theorem and obtain the same differentiation result for E[f ( • , g 1 )]. Meanwhile, beginning from (E.71), we can write using the triangle inequality E g2 n i=1 g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t ≤ n i=1 C |g 1i |n 3/2 + C √ n|g 1i | 3 . By Gauss-Lipschitz concentration and Lemma G.9, we have that g 1 2 ≤ g 1 2 ≤ 2 with probability at least 1 -2e -cn , and since g 1 ∞ ≤ g 1 2 , we conclude with the same probability that |g 1i | ≤ 2 simultaneously for all i. Meanwhile, if X ∼ N (0, 1), we have for any t ≥ 0 that P[|X| ≥ t] ≥ 1 -Ct, where C > 0 is an absolute constant; so if X i ∼ i.i.d. N (0, 1), we have by independence and if t is less than an absolute constant P [∀i, |X i | ≥ t] ≥ (1 -Ct) n ≥ 1 -C nt, where the last inequality uses the numerical inequality e -2t ≤ 1 -t ≤ e -t , valid for 0 ≤ t ≤ 1 2 . From this expression, we conclude that when 0 ≤ t ≤ cn -1/2 for an absolute constant c > 0, we have P[∀i ∈ [n], |g 1i | ≥ t] ≥ 1 -Cn 3/2 t, so choosing in particular t = cn -(δ+ 2 ) for any δ > 0, we conclude that P [∀i ∈ [n], |g 1i | ≥ cn -3/2-δ ] ≥ 1 -C n -δ . Then with probability at least 1 -C n -δ -2e -cn , we have E g2 n i=1 g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t ≤ Cn 1+δ + C n 3/2 , so as long as δ ≥ 1 2 , we have P E g2 n i=1 g 1i q i (t, g 1 , g i 2 )ρ(-g 1i cot t) sin -2 t ≥ Cn 1+δ ≤ C n -δ + 2e -cn . Proceeding now to the q i term, from the expression (E.69) we get n i=1 q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν) = 1 ψ( v 0 2 ) v 0 , vν ψ ( v ν 2 ) vν 2 2 ψ( v ν 2 ) 2 v ν 2 + v 0 , v ν v ν , vν ψ ( v ν 2 ) vν 2 2 ψ( v ν 2 ) 2 v ν 2 2 -2 v 0 , v ν ψ ( v ν 2 ) vν , v ν ψ( v ν 2 ) 2 v ν 2 -2 ψ ( v ν 2 ) 2 v ν , vν v 0 , v ν vν 2 2 ψ( v ν 2 ) 3 v ν 2 2 - ψ ( v ν 2 ) v ν , vν v 0 , v ν vν 2 2 ψ( v ν 2 ) 2 v ν 3 2 . (E.72) Using the triangle inequality, Cauchy-Schwarz, and Lemma E.31, we obtain v 0 , vν ψ ( v ν 2 ) vν 2 2 ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 + v 0 , v ν v ν , vν ψ ( v ν 2 ) vν 2 2 ψ( v 0 2 )ψ( v ν 2 ) 2 v ν 2 2 ≤ C vν 3 2 , (using also the fact that ψ (x) = 0 and ψ (x) = 0 whenever x is sufficiently near to 0); and v 0 , v ν ψ ( v ν 2 ) vν , v ν ψ( v ν 2 ) 2 v ν 2 + ψ ( v ν 2 ) 2 v ν , vν v 0 , v ν vν 2 2 ψ( v ν 2 ) 3 v ν 2 2 + ψ ( v ν 2 ) v ν , vν v 0 , v ν vν 2 2 ψ( v ν 2 ) 2 v ν 3 2 ≤ C vν 2 + C vν 3 2 , from which we conclude n i=1 q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν) ≤ C vν 2 + C vν 3 2 for some absolute constants C, C > 0. By Lemma E.17, there is an event E of probability at least 1-Ce -cn on which we have vν 2 ≤ 4 for every ν. Moreover, we have from the triangle inequality that vν 2 ≤ g 1 2 + g 2 2 , which is independent of ν; and in particular we have n i=1 q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν) 2 ≤ C( g 1 2 + g 2 2 ) + C ( g 1 2 + g 2 2 ) 3 2 , which is a polynomial in g 1 2 and g 2 2 by the binomial theorem. Thus, applying independence, Lemma G.10, Lemma G.11 yields that there is an absolute constant C > 0 such that E g1,g2 C( g 1 2 + g 2 2 ) + C ( g 1 2 + g 2 2 ) 3 2 ≤ C . Therefore, as in the framework section of the proof of Lemma E.13, we can use the inequality E g2 n i=1 q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν) ≤ E g2 n i=1 q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν) , (E.73) together with the partition E g2 n i=1 q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν) ≤ C + E g2 1 (E ) c n i=1 q i (ν, g 1 , g 2 ) σ(g 1i cos ν + g 2i sin ν) , (E.74) and this last expression can be used to obtain a g 1 event of not much smaller probability 1 -Ce -cn on which the LHS of (E.74), and hence the LHS of (E.73), is controlled by an absolute constant uniformly in ν (in particular, using Markov's inequality as in the framework section of the proof of Lemma E.13). Consequently, one more application of the triangle inequality gives that P E g2 [Ξ 5 (ν, g 1 , g 2 )] -E g1,g2 [Ξ 5 (ν, g 1 , g 2 )] is C + C n 1+δ -Lipschitz ≥ 1 -C e -cn -C n -δ as long as δ ≥ 1 2 . Lemma E.46. In the notation of Lemma E.13, if d ≥ 1, there are absolute constants c, C, C , C > 0 and absolute constants K, K > 0 such that if n ≥ Kd log n and d ≥ K , there is an event with probability at least 1 -Ce -cn -C n -d on which one has ∀ν ∈ [0, π], |Ξ 6 (ν, g 1 , g 2 ) -E[Ξ 6 (ν, g 1 , g 2 )]| ≤ C d log n n . Proof. The argument is extremely similar to Lemma E.43, since both terms have small expectations and deviations essentially determinable by the same mean-zero random variable. We are going to control the expectation first, showing that it is small; then prove that |Ξ 6 | is small uniformly in ν. Let E denote the event E 0.5,0 in Lemma E.16; then by that lemma, E has probability at least 1 -Ce -cn as long as n ≥ C , where c, C, C > 0 are absolute constants, and for (g 1 , g 2 ) ∈ E, one has for all ν ∈ [0, π] Ξ 6 (ν, g 1 , g 2 ) = 3 v 0 , v ν v ν , vν 2 v 0 2 v ν 5 2 . Thus, if we write Ξ 6 (ν, g 1 , g 2 ) = 31 E (g 1 , g 2 ) v 0 , v ν v ν , vν 2 v 0 2 v ν 5 2 , we have Ξ 6 = Ξ 6 for all ν whenever (g 1 , g 2 ) ∈ E, so that for any ν |E[Ξ 6 (ν, g 1 , g 2 )]| = E[ Ξ 6 (ν, g 1 , g 2 )] + E[1 E c Ξ 6 (ν, g 1 , g 2 )] ≤ |E[ Ξ 6 (ν, g 1 , g 2 )]| + Ce -cn , where the second line uses the triangle inequality and the Schwarz inequality and Lemma E.37 together with the Lyapunov inequality. We proceed with analyzing the expectation of Ξ 6 . Using the Schwarz inequality gives Next, we control the deviations of Ξ 6 with high probability. By Lemma E.17, there is an event E a with probability at least 1 -e -cn on which vν 2 ≤ 4 for every ν ∈ [0, π]. Therefore on the using in particular sin ν ≤ ν and sin ν ≥ (2/π)ν. Using similarly concavity of cos on this domain, in particular the inequalities cos ν ≤ π/2 -ν and cos ν ≥ 1 -(2/π)ν, we have E Ξ 6 (ν, g 1 , g 2 ) ≤ 3E v 0 , v ν 2 v ν , vν 4 1/2 E 1 E v 0 2 2 v ν 10 2 1/2 ≤ 192E v 0 , v ν 2 v ν , vν -(C 3 -C 4 ν) cos ν ≤ -C 4 ν 2 - 2C 3 π + C 4 π 2 ν + C 3 . In total, we have a bound q (ν) ≤ -81 cos 3 ν - 2C 2 π + C 4 ν 2 + 2C 3 π + πC 4 2 + C 1 ν -C 3 . We calculate the maximizer of the concave quadratic function of ν in the previous bound via differentiation; plugging in, we get q (ν) ≤ -81 cos 3 ν + 2C3 π + πC4 2 + C 1 2 4 2C2 π + C 4 -C 3 . A numerical estimate gives 2C3 π + πC4 2 + C 1 2 4 2C2 π + C 4 -C 3 ≤ 20, and using that -cos 3 is strictly decreasing for ν < π, we can therefore guarantee q ≤ 0 as long as ν ≤ cos -1 3 20/81. Writing c = cos -1 3 20/81, we estimate numerically 0.90 ≥ c ≥ 0.89, so that this bound is nonvacuous. For ν ≥ c, we apply again concavity of cos to develop the lower bound cos ν ≥ π/2 -ν π/2 -c cos c, ν ∈ [c, π/2]. Using this to estimate the -cos 3 term in our upper bound for q , we obtain a bound 3.3, 3.4] , so that we need only consider the smaller root. Differentiating once more to determine the class of the critical point, we find for the second derivative at M/2 - 1 2 √ M 2 -4N -3D M 2 -4N < 0, so that M/2 -1 2 √ M 2 -4N is a maximizer for our cubic bound, and the bound is increasing for arguments less than this point and decreasing for arguments greater than it; we can conclude that the zero in [3.3, 3.4 ] is a minimizer, so that our bound can be ascertained negative by checking its value at M/2 -1 2 √ M 2 -4N . We find using a numerical estimate -20 π/2 -(M/2 -1 2 √ M 2 -4N ) π/2 -c 3 - 2C 2 π + C 4 (M/2 -1 2 M 2 -4N ) 2 + 2C 3 π + πC 4 2 + C 1 (M/2 -1 2 M 2 -4N ) -C 3 ≤ -1.7 < 0, which proves that q ≤ 0 on [c, π/2]. This shows that our lower bound on νh 1 (ν) + h 0 (ν) in (E.76) is nonincreasing on [0, π/2], so that we can assert νh 1 (ν) + h 0 (ν) ≥ 22π(π/2) -(6π 2 + 128) sin(π/2) + 27 sin 3 (π/2) + (89 -2π 2 )(π/2) + 31π cos(π/2) = 5π 2 -101. It remains to bound ν 2 h 2 (ν) = ν 2 (3π cos ν -11 sin ν). On [0, π/2], cos is decreasing and sin is increasing, so 3π cos ν -11 sin ν is decreasing here; it is positive at ν = 0 and negative at ν = π/2, so that by continuity it has a unique zero in (0, π/2). Denote this zero as ν 0 ; then using that ν 2 ≥ 0 with no zeros in the interior, we can write where the last inequality follows from a numerical estimate of the constants. inf Lemma E.48 (Uniformization). Let (Ω, F, P) be a complete probability space. For some t ∈ R, δ t ≥ 0, S ⊂ R d , and event E ∈ F, suppose that f : S × Ω → R is second-argument measurable and satisfies 1. For all x ∈ S, P[f (x, • ) ≤ t] ≥ 1 -δ t ; 2. For all g ∈ E, f ( • , g) is L-Lipschitz; 3. There is M > 0 such that sup x∈S x 2 ≤ M . Then g → sup x∈S f (x, g) is measurable, and for every ε > 0, one has P sup x∈S f (x, • ) ≤ t + Lε ≥ 1 -δ t 1 + 2M ε d -P[E]. (E.78) Proof. Because S is a subset of the separable metric space (R d , • 2 ) and all sample trajectories f ( • , g) are assumed (Lipschitz) continuous, the supremum in the definition of g → sup x∈S f (x, g) can be taken on a countable subset of S, and the resulting function of g is measurable (e.g., (Ledoux & Talagrand, 1991, §2.2 p. 45)) . By (Vershynin, 2018, Proposition 4.2.12 ) and boundedness of S, for every ε > 0 there exists an ε-net of S having cardinality at most (1 + 2M/ε) d ; denote these nets as N ε . Since each N ε is finite, we may also define for each x ∈ S a point x ε such that xx ε 2 ≤ ε; then for every g ∈ E, we have |f (x, g) -f (x ε , g)| ≤ Lε. We define a collection of events E ε by E ε = {g ∈ Ω | ∀x ∈ N ε , f (x, g) ≤ t}. (E.79) The triangle inequality then implies that if g ∈ E ε ∩ E, then for all x ∈ S, one has f (x, g) ≤ t + Lε. Consequently, several union bounds yield It is clear from the analytical expression for φ and the mean value theorem that ϕ is strictly increasing on [0, π], since (π -ν) sin ν > 0 if 0 < ν < π. To prove strict concavity for ν ∈ (0, π), we start by simplifying notation. Consider the function ϕ r (ν) = ϕ(π -ν), which satisfies by the chain rule φr (ν) = φ(π -ν). Because ϕ r is strictly concave if and only if ϕ is strictly concave, it suffices to prove that φ(π -ν) < 0. We note φ(π -ν) < 0 ⇐⇒ (π 2 -[ν cos ν -sin ν] 2 )(-sin ν -ν cos ν) < ν 2 sin 2 ν(sin ν -ν cos ν). Multiplying both sides of the latter inequality by sin ν -ν cos ν, dividing through by (ν cos νsin ν) 2 (which is positive on (0, π), since it equals cos 2 ϕ composed with a reversal about π), and distributing and moving terms to the RHS gives the equivalent condition π 2 ν 2 cos 2 ν -sin 2 ν (ν cos ν -sin ν) 2 < ν 2 -sin 2 ν, and canceling once more gives equivalently ν cos ν + sin ν ν cos ν -sin ν < ν 2 -sin 2 ν π 2 . (E.83) Using ν cos ν -sin ν < 0, which follows from its derivative -ν sin ν being negative on (0, π), and writing g(ν) = π -2 (ν 2 -sin 2 ν), we have equivalently ν cos ν + sin ν > g(ν)(ν cos ν -sin ν), and rearranging gives the inequality (1 -g(ν))ν cos ν + g(ν) sin ν > -sin ν. (E.84) Strict concavity of sin on (0, π) gives sin ν < ν, and 0 < g(ν) < 1 follows after squaring; so the LHS is a convex combination of ν cos ν and sin ν, which in particular satisfies |(1 -g(ν))ν cos ν + g(ν) sin ν| ≤ max{|sin ν|, |ν cos ν|}. As argued before, we have sin ν -ν cos ν > 0 if ν ∈ (0, π); moreover, because ν > 0 we have ν cos ν > 0 if ν ∈ (0, π/2) and ν cos ν < 0 if ν ∈ (π/2, π). We can numerically determine sin(5π/8) + (5π/8) cos(5π/8) > 0, and given that 5π/8 ≥ 1.95 > π/2, it follows |(1 -g(ν))ν cos ν + g(ν) sin ν| < |sin ν|, 0 < ν ≤ 1.95, which implies (E.84) when 0 < ν ≤ 1.95. Recalling that we are arguing for ϕ r in this setting, we translate our results back to ϕ and conclude that ϕ(ν) < 0 if π -1.95 ≤ ν < π. To address the case where 0 < ν < π -1.95, we employ Lemma E.47; it allows us to conclude φ < 0 provided 0 < ν ≤ π/2, and a numerical estimate gives that π -1.95 < π/2, so that we have φ < 0 for all 0 < ν < π. Taking limits in ϕ gives concavity at the endpoints {0, π} as well. To bound φ away from zero on [0, π/2], we apply Lemma E.47 to assert φ(ν) ≤ -2 3π ν 3 + 83 24π 3 ν 4 (1 -cos 2 ϕ(ν)) 3/2 , 0 < ν ≤ π/2. The numerator in the last expression is nonpositive if 0 ≤ ν ≤ π/2, and using the lower bound in Lemma E.14 on [0, π/2], we have 1 1 -cos 2 ϕ(ν) ≥ 1 1 -max 2 {1 -1 2 ν 2 , 0} , ν > 0. From nonpositivity of the numerator, it follows φ(ν) ≤ -2 3π ν 3 + 83 24π 3 ν 4 1 -max 2 {1 -1 2 ν 2 , 0} 3/2 , 0 < ν ≤ π/2. (E.85) We have 1 -1 2 ν 2 ≥ 0 as long as 0 ≤ ν ≤ √ 2; so after removing the max, distributing, and cancelling, we have φ(ν) ≤ -2 3π + 83 24π 3 ν 1 -1 4 ν 2 3/2 , 0 < ν ≤ √ 2. The denominator of this last expression is nonnegative and has singularities at ±2, and is clearly even symmetric; so it is maximized on 0 < ν A numerical estimate shows that the RHS of the last inequality is no larger than -0.14. Since the intervals we have proved a bound over cover [0, π/2], this proves the claim with c = -0.14. The bound φ < 1 on (0, π) follows from the fact that ϕ is strictly concave on (0, π) and the mean value theorem; we have already shown φ > 0 in proving strict increasingness of ϕ. Similarly, the proof of strict concavity in the interior has already established φ < 0. To obtain the lower bound on φ, we use that φ is continuous on [0, π] and the Weierstrass theorem to assert that there is C ≥ 0 such that φ ≥ -C on [0, π]; because φ(0) = 0, we actually have C > 0. For the quadratic model, we use our previous results and Taylor expand ϕ about 0; we get immediately ϕ(ν) ≥ ν + ν 2 inf ν∈[0,π] φ(ν) 2 ≥ ν -(C/2)ν 2 . For the upper bound, we can assert immediately on [0, π/2] a bound ϕ(ν) ≤ ν -cν 2 , where c = 0.07 suffices. To extend the bound to ν ∈ [π/2, π], we employ a bootstrapping argument; because ϕ is concave, we have a bound ϕ(ν) ≤ ϕ(π/2) + φ(π/2)(ν -π/2) = cos -1 π -1 + π/2 √ π 2 -1 (ν -π/2) , where the second line plugs into the formulas for ϕ and φ. We will show that the graph of ν -cν 2 lies entirely above the graph of the RHS of this inequality. This condition is equivalent to -cν 2 + 1 - π/2 √ π 2 -1 ν + (π/2) 2 √ π 2 -1 -cos -1 π -1 ≥ 0; the LHS of this inequality is a concave quadratic with maximizer ν = 1/(2c) 1 -π/2 √ π 2 -1 , and numerical estimation of the constants gives ν ≥ π. Since ν is outside [π/2, π] and the quadratic is concave, we conclude that the bound is tightest at the boundary point π/2, and one checks numerically -cπ 2 /4 + 1 - π/2 √ π 2 -1 π/2 + (π/2) 2 √ π 2 -1 -cos -1 π -1 ≥ 0.15 > 0, which establishes that the bound ϕ(ν) ≤ ν -cν 2 actually holds on all of [0, π]. This completes the proof of all of the claims.

F CONTROLLING CHANGES DURING TRAINING F.1 PRELIMINARIES

We now consider the changes in the integral operator Θ k during gradient descent. In this section we restore the iteration subscript (that is dropped in other sections to lighten notation) to various quantities. Θ k changes during training as a result of both smooth changes in the features at all layers and non-smooth changes in the backward features {β t (x)} due to the non-smoothness of the derivative of the ReLU function. Because of the difficulty of reasoning precisely about the changes in Θ t , we will bound these rather naively by controlling Θ t over all possible support patterns of the features given a bound on the norm change of the pre-activations. We now define a trajectory in parameter space that interpolates between the iterates of gradient descent, given for any k ∈ {0, . . . , k} and s ∈ [0, 1] by θ N k +s = θ N k -τ s ∇L N (θ N k ), (F.1) (with the formal derivative ∇ defined in Appendix A.1). We will henceforth use k to denote an integer indexing the iteration number and t to denote a continuous parameter taking values in [0, k] (such that k = t , s = t -t ). Quantities indexed by t are ones where the parameters take the value θ N t . To lighten notation, we will drop the N superscript when referring to time-indexed quantities (aside from ζ N k and Θ N k ), but all such quantities depend on the parameters as defined by (F.1). Instead of considering the change in the features α t (x) directly, it will be more convenient to work in terms of the pre-activations, which are given at layer by ρ t (x) = W t P I -1,t (x) W -1 t P I -2,t (x) W -2 t . . . P I1,t(x) W 1 t x. We define a maximal allowable change in the pre-activation norm by η = C η L 3/2+q √ n (F.2) for q ≥ 0 and a constant C η to be specified later, where the scaling is chosen with foresight. We can then define a maximal number of iterations such that the pre-activation norms at all layers along the trajectory (F.1) change by no more than η. This number k η must satisfy ρ t (x) -ρ 0 (x) 2 ≤ η (F.3) for η given by (F.2). Our goal will be to show that we can in fact train for long enough so as to reduce the fitting error without exceeding k η iterations.

F.2 CHANGES IN FEATURE SUPPORTS DURING TRAINING

Recalling the definition of the feature supports at layer , time t and x ∈ M by I ,t (x) = supp(α t (x) > 0) ⊆ [n], we denote by I t (x) = (I 1,t (x), . . . , I L,t (x)) the collection of these support patterns at all layers. We would next like to relate the smooth changes in the pre-activation norms to the non-smooth changes in the supports of the features. We denote by J = (J 1 , . . . , J L ) a collection of support patterns with J i ∈ [n]. We now consider sets of support patterns that are not too different from those at initialization, as defined by B(y, η) = {supp (y + v > 0) | v 2 ≤ η} , (F.4) J η (x) = ⊗ ∈[L] B(ρ 0 (x), η), (F.5) J η (M) = x∈M J η (x). (F.6) Note that B(ρ 0 (x), η) is simply the set of supports of the positive entries of ρ 0 (x) + v for every possible perturbation v of norm at most η. We consider all possible perturbations due to the complex nature of the training dynamics. As a result of this worst-casing, the scaling we will require of the depth and width of the network in order to guarantee that the changes during training are sufficiently small is expected to be suboptimal. For a given general support pattern J , we define generalized backward features and transfer matrices β J t = W N L+1 t P J L W N L t . . . W N +2 t P J +1 * , Γ : J t = W t P J -1 W N -1 t . . . P J W N t (F.7) where the weights are given by (F.1) (and thus β t (x) = β N It(x)t ). By controlling these objects for every possible set of supports J that can be encountered during training, we can control the smooth changes in the features themselves. A first step towards this end is understanding how many such support patterns we expect to see given the constraint in (F.2). In order to bound the number supports that can be encountered during training, we need to control the diameter of B(ρ 0 (x), η). This can be done by defining δ η (ρ 0 (x)) = max where denotes the symmetric difference. Since the pre-activation at a given layer are Gaussian variables conditioned on all the previous layer weights, bounding the size of δ η (ρ 0 (x)) can be reduced to showing concentration of a certain function of Gaussian order statistics. This is achieved in the following lemma : Lemma F.1. For η given by (F.2), if n, L, d satisfy the requirements of lemma F.6 and n > d 5 for some constant K, then for a vector g ∈ R n , g i ∼ iid N (0, 1 n ) we have P δ η (g) > Cnη 2/3 ≤ C e -cd for some constants c, C, C . g Ji e Ji . Clearly for any y such that d s (g, y ) = δ η we have gy 2 ≤ gy 2 . Since there exists at least one such y ∈ B E (g, η), it follows that gy 2 ≤ gy 2 ≤ η, and hence the smallest k such that k i=1 g (i) 2 ≥ η 2 must obey k ≥ δ η . For η defined in (F.2), we can satisfy the requirement on η in lemma F.6 by requiring n > d 5 for some K. Applying this lemma, we find that there is a constant K such that for k = K nη 2/3 we have P k i=1 |g| 2 (i) ≥ η 2 ≥ 1 -C e -cd from which it follows immediately that P k η ≤ K nη 2/3 ≥ 1 -C e -cd . Choosing some constant C such that K nη 2/3 ≤ Cnη 2/3 and using (F.9) allows us to bound δ η (g) with the same probability. With this result in hand, we can control the objects in (F.7) for all the supports in J η (M). Lemma F.2. Assume d, L, n satisfy the assumptions of lemmas D.2, F.1, D.8, D.14 and additionally n ≥ max KdL 9+2q , K (log n) 3/2 , C 3 0 C 2 η L 6+2q , for some constants K, K , C 0 , where q is the constant in (F.2). Then i) for η, J η (x) defined in (F.2),(F.5), on an event of probability at least 1 -e -cd , simultaneously sup x∈M sup J ∈J η (x) sup ∈[L] 1≤ < ρ 0 (x) 2 ≤ C 2 , sup x∈M sup J ∈J η (x) sup ∈[L] 1≤ < β -1 J ,0 2 ≤ C 2 √ n, sup x∈M sup J ∈J η (x) sup ∈[L] 1≤ < Γ : J ,0 ≤ C 2 √ L. ii) for T η defined in (F.3), on an event of probability at least 1 -e -cd , for some constants c, C. Proof. Deferred to F.5.

F.3 CHANGES IN FEATURES DURING TRAINING

We can now bound the smooth changes during training: Lemma F.3 (Smooth changes during training). Set the step size τ and a bound on the maximal number of iterations k max such that k max τ = L q n for some constant q. Assume n, L, d satisfy the requirements of lemmas F.2, and in particular n ≥ KdL 9+2q for some K. Assume also that given some k ≤ k max -1, for all k ∈ {0, . . . , k}, Proof. We will bound the smooth changes in the network features during gradient descent with respect to either the population measure µ ∞ or the finite sample measure µ N . We denote a measure that can be one of these two by µ N . For any collection of supports J ∈ J η (M), define generalized backward features and transfer matrices at t by β J t , Γ : J t . These are obtained by setting the network parameters to be θ N t according to (F.1), but setting all the support patterns to be those in J . We then define for any t ∈ [0, k + 1], In order to control these, we also note that for any t ∈ [0, k + 1], α t (x) 2 ≤ ρ t (x) 2 ≤ ρ t (x) -ρ 0 (x) 2 + ρ 0 (x) 2 ≤ ρ η t , (F.14) and similarly β J t 2 ≤β η t , Γ : J t ≤Γ η t . (F.15) In particular, the above bounds hold when J = I t (x). We would now like to understand how the quantities ρ η k , β In order to control these quantities at t = 0 we define an event (F.21) the probability of which can be controlled using lemma F.2. On G, the upper bound on τ t in (F.20) is at least 1 ρ t (x) -ρ 0 (x) 2 ≤ t -1 k =0 ρ k +1 (x) -ρ k (x) 2 + ρ t (x) -ρ t (x) 2 = t -1 k =0 1 0 ∂ ∂s ρ k G = x∈M, J ∈J η (x), 1≤ ≤ ∈[L] ρ 0 (x) 2 ≤ C 2 ∩ β -1 J ,0 2 ≤ C 2 √ n ∩ Γ : J ,0 ≤ C 2 √ L , C √ dLρ η 0 β η 0 Γ η 0 ≥ 1 C √ dL 3/2 √ n (F.22) for some C . We would now like to pick τ k max , and ensure that any t ∈ [0, k + 1] satisfies the constraint above if k + 1 ≤ k max . The analysis also assumes that τ k max ≤ T η for which (F.3) holds. We will then pick the scaling factor C η for the pre-activation norm bound in (F.2) in order to satisfy that constraint as well. We choose τ k max = L q n . (F.23) In order to ensure that k max ≤ k * holds, we use (F.20) and (F.22), and require L q n ≤ 1 C √ dL 3/2 √ n which is satisfied by demanding n ≥ KdL 3+2q for some constant K. Using (F.17  ρ t (x) -ρ 0 (x) 2 ≤ ρ η t -ρ η 0 ≤ C τ t √ dL (ρ η 0 ) 2 β η 0 Γ η 0 ≤ C τ t √ dL 3/2 √ n ≤ C τ k max √ dL 3/2 √ n. In order to ensure that k max ≤ k η we therefore require We have thus ensured that our choice of k max satisfies k max ≤ min {k * , k η }. We then obtain from the constraints in (F.19), the inequalities in (F.18) and the definition of G, that on this event sup k ∈{0,...,k+1} C τ k max √ dL 3/2 √ n ≤ η, ρ η k -ρ η 0 ≤ C τ k max √ dL 3/2 √ n ≤ C √ dL 3/2+q √ n , sup k ∈{0,...,k+1} β η k -β η 0 ≤ C τ k max √ dL 3/2 n ≤ C √ dL 3/2+q . Then using (F.12) and (F.13), we obtain on an event of probability at least 1 -e -cd simultaneously Lemma F.5 (Uniform control of changes in Θ during training). Denoting the gradient descent step size by τ , choose some k max such that k max τ = L q n for some constant q. Assume also that given some k ≤ k max -1, for all k ∈ {0, . . . , k}, with the convention β L t (x) = 1 for all t, x, and the parameters θ N t given by (F.1). We thus have Θ N k (x, x ) -Θ(x, x ) ≤ L =0 1 s=0 α k+s (x), α k (x ) β k+s (x), β k (x ) -α 0 (x), α 0 (x ) β 0 (x), β 0 (x ) ds. We consider a single summand in the above expression. On the event defined in lemma F.4, for all x, x ∈ M, ∈ {0, . . . , L}, ≤C log 3/4 (L)d 3/4 L 3+2q/3 n 11/12 , for some constants. Summing this bound over gives the desired result.

F.5 AUXILIARY LEMMAS AND PROOFS

Lemma F.6. Consider a collection of n i.i.d. variables X i = g 2 i , g i ∼ N (0, 1 n ) and denote the order statistics by X (i) (so that X (1) ≤ X (2) . . . ). For any d ≥ K log n, n ≥ K d 3 , η > C d 9/8 n 3/4 and integer Knη 2/3 ≤ k ≤ n, where K, K , K are appropriately chosen absolute constants, we have P k i=1 X (i) ≥ η 2 ≥ 1 -C e -cd , where c, C, C are absolute constants. Proof. We will relate sums of order statistics of X i to functions of uniform order statistics and show that these concentrate. We denote the CDF of the X i and its inverse by F and F -1 respectively. We use (X (1) , ..., X (k) ) d = (F -1 (U (1) ), ..., F -1 (U (k) )) where U (i) are order statistics with respect to Unif(0, 1) (David, 2011). Since X i ∼ 1 n Y i , Y i ∼ χ 2 1 we have F (x) = erf( nx 2 ) F -1 (t) = 2 n (erf -1 (t)) 2 ≥ c 0 n t 2 where in the inequality we used the series representation of erf -1 . This gives k i=1 X (i) d = k i=1 F -1 (U (i) ) ≥ c 0 n k i=1 U 2 (i) . The joint PDF of the first k order statistics for any distribution admitting a density is given by f (1)...(k) (x 1 , ..., x k ) = n! (n -k)! (1 -F (x k )) n-k k i=1 f (x i ) where x 1 ≤ x 2 ≤ ... ≤ x k (David, 2011) . Applying this to the uniform order statistics, we can compute the mean of the summands EU 2 (i) = n! (n -k)! u2 0 ... u k 0 1 0 u 2 i (1 -u k ) n-k du k ...du 1 = n!i(i + 1) (n -k)!(k + 1)! 1 0 u k+1 k (1 -u k ) n-k du k = i(i + 1) (n + 2)(n + 1) E k i=1 U 2 (i) = k(k + 1)(k + 2) 3(n + 2)(n + 1) ≥ c 1 k 3 n 2 . In order to show concentration, we appeal to the Rényi representation of order statistics (Boucheron et al., 2012) . This allows us to write k i=1 U 2 (i) as a Lipschitz function of independent exponential random variables, and we can then apply standard concentration results for such functions (Talagrand, 1995) . This representation is due to a useful property unique to the exponential distribution whereby the differences between order statistics are independent exponentially distributed variables themselves when properly normalized. If we define by E i , ..., E n a collection of independent standard exponential variables, the Rényi representation of the uniform order statistics gives (U (1) , ..., U (k) ) d = (1 -exp - E 1 n , ..., 1 -exp   - k j=1 E j n -j + 1   ). We now truncate the (E 1 , ..., E k ), so that w.p. P ≥ 1 -ke -K we have ∀i : E i ∈ [0, K], and denote this event by E K . Using K < n 2 , it is evident that (i) written in terms of Ẽi is now 1-Lipschitz and convex, and we can apply Talagrand's concentration inequality (Talagrand, 1995) to obtain P k i=1 1 E K U 2 (i) -E1 E K k i=1 U 2 (i) ≥ tKλ ≤ C exp -ct 2 . Setting t = c1k 3 2n 2 Kλ = c1k 2 (n-k) 8n 2 K , if we now assume c 2 nη 2/3 ≤ k ≤ c n for some c < 1 we obtain P k i=1 1 E K U 2 (i) -E k i=1 1 E K U 2 (i) ≥ c 1 k 3 2n 2 ≤ C exp - c k 4 n 2 K 2 ≤ C exp - c n 2 η 8/3 K 2 . We would also like to ensure that the truncation does not cause a large deviation in the mean. We have E {U(i)} k i=1 U 2 (i) -1 E K k i=1 U 2 (i) = E {Ei} k i=1   1 -exp   - i j=1 E j n -i + 1     2 -1 E K k i=1   1 -exp   - i j=1 E j n -i + 1     2 ≤ l m=1 E {Ei} 1 Em>k k i=1   1 -exp   - i j=1 E j n -i + 1     2 ≤ k l m=1 E {Ei} 1 Em>K = k 2 E {Ei} 1 E1>K = k 2 e -K . Since we would like this to be small compared to E k i=1 U 2 (i) ≥ c1k 3 n 2 we can require K > log 4n 2 c1k which gives E {U(i)} k i=1 U 2 (i) -1 E K k i=1 U 2 (i) < c1k 3 4n 2 . We can then choose the constant c 2 such that with probability P ≥ 1 -exp (log k -K) -C exp -cn 2 η 8/3 K 2 k i=1 X (i) ≥ c 0 n 1 E K k i=1 U 2 (i) ≥ c 0 n E1 E K k i=1 U 2 (i) - c 1 k 3 2n 2 = c 0 n E k i=1 U 2 (i) - c 1 k 3 2n 2 + E1 E K k i=1 U 2 (i) -E k i=1 U 2 (i) ≥ c 0 c 1 k 3 4n 2 ≥ η 2 . The upper bound on k can then be removed since the inequality then applies to all larger k automatically. If we now set η according to equation (F.2), and choose K = d ≥ K log n and n satisfying n ≥ K d 3 for appropriate constants K , K we obtain where W 0 (:,1) is the first column of W 0 . Bernstein's inequality then gives P k i=1 X (i) ≥ η 2 ≥ 1 -exp (log k -d) -C exp - cC 8/3 η L 4+8q/3 n 2/3 d 2 ≥ 1 -C e -c P ρ 0 (x) 2 2 ≤ C α -1 0 (x) 2 2 ≥ 1 -C e -cn for appropriate constants. As discussed in Lemma D.8, if we choose d to satisfy the requirements of this lemma then N n -3 n -1/2 0 ≤ e C d for some constant. We can then uniformize over the net using a union bound, obtaining P ∀x ∈ N n -3 n -1/2 0 : ρ 0 (x) 2 2 ≤ C α -1 0 (x) 2 2 ≥ 1 -C e C d-cn ≥ 1 -C e -c n for some c , assuming n ≥ Kd. We now need to control the feature norms and pre-activation norms off of the net. From (D.62) and lemma G.10 we obtain that for d satisfying the requirements of lemma D.9, P   ∀x ∈ M, ∈ [L] : ∃x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x) s.t. ρ 0 (x) -ρ 0 (x) 2 ≤ Cn -5/2 ∩ α -1 0 (x) 2 -1 ≤ 1 2   ≥ 1 -e -cd . By taking a union bound over the above two results, we obtain P ∀x ∈ M, ∈ [L] : ρ 0 (x) 2 ≤ C ≥ 1 -C e -c d (F.25) for some constants We next turn to controlling the generalized backward features and transfer matrices. Our first task is to bound the number of support patterns that can be encountered, namely J η (M) . In order to do this it will be convenient to introduce a set that contains J η (M) and is easier to reason about. We define a metric between supports by (F.26) We will aim to control the volume of this set, which we will achieve by controlling it first on a net. This will require transferring control between different nearby points. R x, Cn -3 , with R x, Cn -3 denoting the number of risky features as defined in section D.3.1. We denote this event by E ρ . On E ρ , we can transfer control of the ball of feature supports from a point on the net to any point on the manifold. For some , x we denote by x the point on the net that satisfies the above condition. Considering (F.28), we choose g = ρ 0 (x), p = ρ 0 (x), r = Cn -5/2 and η = C η L 3/2+q n -1/2 , obtaining B s (supp(ρ 0 (x) > 0), δ η (ρ 0 (x))) ⊆ B s (supp(ρ 0 (x) > 0), δ η+Cn -5/2 (ρ 0 (x)) + 2d s (ρ 0 (x), ρ 0 (x))) ⊆ B s (supp(ρ 0 (x) > 0), δ 2η (ρ 0 (x)) + 2d), (F.30) where we assumed C η L 3/2+q n 2 > C. We next turn to controlling δ 2η (ρ 0 (x)), which is now a random variable, for all ∈ [L], x ∈ N n -3 n -1/2 0 . From lemma F.1 we have for a vector g with g i ∼ iid N (0, 1 n ), P δ 2η (g) ≥ C 0 nη 2/3 ≤ C e -cd . Considering a vector ρ 0 (x) for some ∈ [L], x ∈ N n -3 n -1/2 0 , we have ρ 0 (x) ∼ N (0, 2 α -1 0 (x) where we used (F.30). On E ρ ∩ E N δ , we can thus bound the size of the set that contains J η , denoting its size by S η = Vol x∈M L ⊗ =1 B s (sign(ρ 0 (x)), δ η (ρ 0 (x))). We first note that for any p, VolB s (p, C 1 nη 2/3 ) ≤ C1nη 2/3 i=0 n i ≤ C C 1 nη 2/3 n C1nη 2/3 ≤ C n C C1nη 2/3 for appropriate constants, assuming nη 2/3 > K log(nη 2/3 ) for some K. It follows that on E ρ ∩ E N δ , S η = Vol for some constants. We would next like to employ lemma D.14 in order to control the quantities of interest for a single J ∈ J η (M), and then take a union bound utilizing the upper bound above on J η . This will require controlling the event E δK in the lemma statement with an appropriate choice of the constants δ s , K s . As in other sections, we use the convention Γ : +1 J 0 = I for any ∈ [L]. At a given collection of supports J ∈ J η (x) for some x ∈ M, we choose x as the anchor point in lemma D.14. From (F.27) we have, for any x ∈ N n -3 n -1/2 0 , δ η (ρ 0 (x)) ≤ d s ρ 0 (x), ρ 0 (x) + δ η+ ρ 0 (x)-ρ 0 (x) 2 (ρ 0 (x)). Then using (F.29) we obtain to bound the two terms in the RHS gives P ∀x ∈ M, ∈ [L] : ∃x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x) s.t. δ η (ρ 0 (x)) ≤ d + δ 2η (ρ 0 (x)) ≥ 1 -6e -d/2 . where we used η = C η L 3/2+q n -1/2 and d satisfies the requirements of lemma D.8. Using (F.33) to bound d + δ 2η (ρ 0 (x)) uniformly on N n -3 n -1/2 0 and , and combining the failure probabilities of these events by a union bound, we obtain P ∃x ∈ M s.t. δ η (ρ 0 (x)) > C 1 nη 2/3 ≤ 6e -d/2 + Ce -cd ≤ C e -c d for some constants. Since δ η (ρ 0 (x)) ≥ |J I ,0 (x)| for any J ∈ J ∈ J η (x), implies directly that P ∀x ∈ M, J ∈ J ∈ J η (x) : |J I ,0 (x)| ≤ C 1 nη 2/3 ≥ 1 -C e -c d . (F.35) In the notation of lemma D.14 we denote this event by E δ , and choose δ s = C 1 nη 2/3 . From the definition of J η , for every x ∈ M and J that is an element of J ∈ J η (x), J = supp ρ 0 (x) + v > 0 for some v such that v 2 ≤ η. We now consider the vector w = P J -P I (x) ρ 0 (x). Note that for any element of w i that is non-zero, we must have |v i | ≥ ρ 0 (x) i (since the perturbation must change the sign of this element), in which case we have |w i | = ρ 0 (x) i . Denoting the set of indices of these non-zero elements by Q, we have w 2 2 = i∈Q w 2 i = i∈Q ρ 0 (x) i 2 ≤ i∈Q v 2 i ≤ v 2 2 ≤ η 2 . This holds for all ∈ [L]. Thus if we set K s = η for K s , the event E K in lemma D.14 holds with probability 1. We therefore choose E δK = E δ with E δ defined in (F.35). In order to apply D.14 we must also ensure δ s = C 0 nη 2/3 ≤ n L , K s = η ≤ 1 2 L -3/2 . Setting η = CηL 3/2+q √ n as per (F.2), we can satisfy these requirements by demanding n ≥ C 3 0 C 2 η L 6+2q . We are now in a position to apply lemma D.14 to control the objects of interest. We use rotational invariance of the Gaussian distribution repeatedly to obtain 1 E δK β J ,0 2 =1 E δK W L+1 0 Γ L: +2 J 0 P J +1 2 ≤ a.s. 1 E δK W L+1 0 Γ L: +2 J 0 2 d =1 E δK Γ L: +2 J 0 W L+1 * 0 2 d = 1 E δK Γ L: +2 J 0 e 1 2 W L+1 * 0 2 . Recalling that W L+1 i ∼ N (0, 1), we use Bernstein's inequality to obtain P W L+1 2 > C √ n ≤ C e -cn , and another application of lemma D.14 gives P 1 E δK Γ L: +2 J 0 e 1 2 > C ≤ C e -c n L for some constants. Hence after worsening constants P 1 E δK β J ,0 2 > C √ n ≤ C e -c n L . (F.36) We also obtain 1 E δK Γ : J ,0 = 1 E δK W 0 Γ -1: -1 J ,0 P J ≤ a.s. 1 E δK W 0 Γ -1: -1 J ,0 P 1 E δK Γ : J ,0 > C √ L ≤ C e -cn + C e -c n L ≤ C e -c n L where we used an ε-net argument to bound W 0 and lemma D.14 to bound E δK Γ -1: -1 J ,0 . We now combine this result with (F.36). It remains to uniformize this result over the choice of and J . Combining (F.26) and (F.34) gives P J η (M) > C e C dLnη 2/3 ≤ C e -cd . (F.37) Denoting the complement of this event by E J , and setting η = CηL 3/2+q √ n , on this event we have P   ∀J ∈ J η (M), < ∈ [L], : ii) We will control β I0(x),0 -β It(x),0 2 using lemma D.21. For t ∈ [0, T η ] we note that by definition of T η and J η (x), I t (x) ∈ J η (x). 1 E δK Γ : J ,0 ≤ C √ L ∩ 1 E δK β J ,0 2 ≤ C √ n E J   ≥ 1 -C e C dLnη 2/3 -c n L ≥ 1 -C e C C As noted in the previous section, if we set d to satisfy lemma D.8 and n ≥ KdL 9+2q for some K, then the requirements of lemma D.14 are satisfied with δ s = C 0 nη 2/3 , K s = η. Lemma G.4 (Hanson-Wright Inequality (Vershynin, 2018, Theorem 6.2.1) ). Let g be a vector of n i.i.d., mean zero, sub-Gaussian variables and A be an n × n matrix. Then for any t > 0, we have P [|g * Ag -Eg * Ag| ≥ t] ≤ 2 exp -c min t 2 K 4 A 2 F , t K 2 A where max i g i ψ2 ≤ K (with • ψ2 denoting the sub-Gaussian norm). Lemma G.5 (Freedman's Inequality (Freedman, 1975, Theorem 1.6 )). Let (∆ i , F i ) be a sequence of martingale differences, with E ∆ i F i-1 = 0, and suppose that |∆ i | ≤ R a.s.. Define the quadratic variation V L = L i=1 E ∆ i 2 F i-1 . Then P ∃i = 1 . . . L s.t. i =1 ∆ > t and V i ≤ σ 2 ≤ 2 exp - t 2 /2 σ 2 + Rt/3 . Lemma G.6 (Moment control Freedman's (de la Peña, 1999)). Let (∆ i , F i ) be a sequence of martingale differences, with E ∆ i F i-1 = 0, and suppose that E ∆ i k F i-1 ≤ k! 2 E ∆ i 2 F i-1 R k-2 ∀k, a.s..

Set

V j = j i=1 E ∆ i 2 F i-1 . Then P ∃i = 1 . . . j s.t. i =1 ∆ > t and V i ≤ σ 2 ≤ 2 exp -t 2 /2 σ 2 + Rt . Lemma G.7 (Martingales with subgaussian increments). Let (∆ i , F i ) be a sequence of martingale differences, and suppose that E exp λ∆ i F i-1 ≤ exp λ 2 V 2 2 , ∀ λ, a.s. Then P L i=1 ∆ i > t ≤ 2 exp - t 2 2LV 2 . Proof. By assumption, E[∆ i ] = 0 for each i ∈ [L]. We calculate using standard properties of the conditional expectation E e λ L i=1 ∆ i = E E e λ L i=1 ∆ i F L-1 = E e λ L-1 i=1 ∆ i E e λ∆ L F L-1 ≤ e λ 2 V 2 /2 E e λ L-1 i=1 ∆ i . Moreover, one has E[e λ∆ 1 | F 0 ] = E[e λ∆ 1 ] ≤ e λ 2 V 2 /2 . An induction therefore implies E e λ L i=1 ∆ i ≤ e λ 2 LV 2 /2 , and the result follows from standard equivalence properties of subgaussian random variables (Vershynin, 2018, Proposition 2.5.2). Lemma G.8 (Azuma-Hoeffding Inequality (Azuma, 1967)). Let (∆ i , F i ) be a sequence of martingale differences, and suppose that ∆ i ≤ R i a.s.. Then P L =1 ∆ i > t ≤ 2 exp      -t 2 2 L =1 R 2 i      . Lemma G.9 (Chi and Inverse-Chi Expectations). Let X ∼ χ(n) be a chi random variable with n degrees of freedom, equal to the square root of the sum of n independent and identically distributed squared N (0, 1) random variables. Then E[X] = √ 2 Γ( 1 2 (n + 1)) Γ( 1 2 n) , and, if n ≥ 2, E[X -1 ] = 1 √ 2 Γ( 1 2 (n -1)) Γ( 1 2 n) . Proof. We use the fact that the density of X is given by ρ(x) = 1 x≥0 (x) 1 2 n/2-1 Γ( 1 2 n) x n-1 e -x 2 /2 , which can be proved easily using the Gaussian law and a transformation to spherical polar coordinates (Muirhead, 1982, Theorem 2.1.3) . The expectation of X then results from a simple sequence of calculations using the change of variables formula: E[X] = 2 2 n/2 Γ( 1 2 n) ∞ 0 x n e -x 2 /2 dx = 1 2 n/2 Γ( 1 2 n) ∞ 0 x n/2-1/2 e -x/2 dx = √ 2 Γ( 1 2 n) ∞ 0 x (n/2+1/2)-1 e -x dx = √ 2 Γ( 1 2 (n + 1)) Γ( 1 2 n) . Now we study X -1 . By the change of variables formula, its density is given by ρ (x) = 1 x≥0 (x) 1 2 n/2-1 Γ( 1 2 n) x -n e -1/(2x 2 ) . A similar sequence of calculations then yields E[X -1 ] = 2 2 n/2 Γ( 1 2 n) ∞ 0 x -n e -1/(2x 2 ) dx = 1 2 n/2 Γ( 1 2 n) ∞ 0 x -1 2 (n+1) e -1/(2x) dx = 1 2 n/2 Γ( 1 2 n) ∞ 0 x 1 2 (n-1)-1 e -1 2 x dx = 1 √ 2Γ( 1 2 n) ∞ 0 x 1 2 (n-1)-1 e -x dx = 1 √ 2 Γ( 1 2 (n -1)) Γ( 1 2 n) , provided n > 1. Lemma G.10 (Equivalence of p Norms). Let 1 ≤ p ≤ q ≤ +∞. Then for every x ∈ R n one has x q ≤ x p ≤ n 1/p-1/q x q . Lemma G.11 (Gaussian Moments). Let p ≥ 1, and let g ∼ N (0, 1) be a standard normal random variable. Then E[|g| p ] = 2 p/2 Γ p+1 2 Γ 1 2 ; E [g] p + = 1 2 E[|g| p ], where [x] + = max{x, 0}. In particular E[|g| p ] ≤ p p/2 , so that g is subgaussian and g 2 is subexponential.



When using data augmentation, for example, the number of samples is effectively infinite yet highly structured, enabling convergence and generalization. The closest result we are aware of is(Chen et al., 2021, Theorem 3.4); this result involves a-priori assumptions on the trained network weights, which are only resolved for two-layer networks, and entails an unnatural relationship between n and N and a possible exponential dependence of N on L, which Theorem 1 avoids. Since we do not use the "NTK parameterization", the norm of our NTK scales like nL rather than L. Due to our scaling of the weights (Section 2.1) the contribution of the final layer to the NTK is negligible and can be dropped. This leads to discrepancies between the expression above and similar expressions found in the literature-we show essential equivalence between our NTK and others in Appendix A.3. Technically, the features α (x) depend on all the weights up to layer and hence so does the projection matrix P I (x) , but our analysis shows that this dependence has only a minor effect on the statistical fluctuations. Certain parts of our argument, such as the concentration result Theorem B.2, are naturally applicable to cases where M± themselves have a finite number of connected components with a mild dependence on this number, and we state them as such. We skip this extra generality in our dynamics arguments to avoid an additional 'juggling act' that would obscure the main ideas. We point out that the curvature of the manifolds does not enter into the proof of the concentration result Theorem B.2, so there is no ambiguity in discussing curvature only in the context of curves. The results ofArora et al. (2019b) apply to data of norm no larger than 1, but it is straightforward to extend our results for spherical data to this setting, using the 1-homogeneity of Θ in each argument (as a kernel on the entire ambient space R n 0 × R n 0 ) to write Θ(x, x ) = x 2 x 2Θ(x/ x 2, x / x 2). Here we treat the empty sum as the appropriate 'zero element' of the space of finite signed Borel measures on M±, namely the trivial measure that assigns zero to every Borel subset of M±. The lower bound is implied by cosh x ≥ 1; the upper bound follows from writing sinh x = 0.5e x (1e -2x ) and using e -x ≥ 1 -x. For any circle and choice of endpoints there will be two such arcs, and the last condition implies that we choose the shorter of the two. To see that this set is indeed an event, use that β S,Σ,∆ (x) is a continuous function of the network weights except with respect to the support projections; but x → 1x>0 is increasing, hence Borel-measurable, and so the set consists of a finite union of Borel-measurable sets. ρ i (x) = W i α i-1 (x) = G i α i-1 (x) 2 > 2n ≤ e -cn and sinceg d = v u g 2



Figure 3: The coaxial circles geometry.

(B.46) by worst-casing with the larger residual from the population error term in (B.51), and made other simplifications by worst-casing some constants. We simplify (B.38) next: we have shown that ζ N,Lip s ∈ Lip(M) and ζ N,Lip s ∈ L ∞ (M) above, and so for every x ∈ M, we have

64); and by construction ζ N,Lip 0 = ζ, and (B.31) and d ≥ 1 then gives (B.65) if L ≥ e. We therefore move to the induction step, assuming that (B.64) and (B.65) hold for k -1 and showing that this implies the bounds for k. We begin by verifying (B.64). Applying the induction hypothesis for k -1 via (B.65), we can write

Given k > 0 and integer d ≥ 2, suppose that M is a d-dimensional complete Riemannian manifold with Ricci curvature tensor satisfying Ric ≥ -(d -1)k. Then for any r, ε > 0 and any p ∈ M, there exists an ε-net (measured in the Riemannian distance dist M ) of the metric ball {x ∈ M | dist M (p, x) ≤ r} with cardinality at most (C M /ε) d , where C M > 0 is a constant depending only on k and r.

choose a subset S ⊂ [n] with cardinality at most d ; using n ≥ e and d ≥ 1, we have d k=0 n k ≤ 1 + d n d ≤ 4dn 2d .

and n ≥ max{K d 3 L, K d 4 , K }. If additionally d ≥ Kd 0 log(nn 0 C M ), we obtain by the discussion in Appendix D.3.1 and another union bound

i1,i2,...,i2r) . (D.85) Considering first the form of the terms in G r,p , using (D.83) and (D.81) and recalling that Γ m-1:m HJ = I, we have g r,p(i1,i2,...,i2r)

term in (D.85) using lemma D.18, setting p = f above and choosing d = n L gives

using (D.111) to bound the terms with r = 1, and setting d 0 = d 1 we obtain

1) u i+1:j+1 2 where u i+1:j+1 = P Ij+1 Γ i:j+2 * H H i+1 * (:,1) and s km ∈ {-1, 1} are the signs of the elements in Q m (x) for m ∈ {i, j}. In the above expression, k m index the entries on which diagQ m is supported, and we denote d m = |suppdiagQ m | and use the permutation symmetry of the Gaussian distribution to set these to be [d m ].

xx = P I (x) P I (x ) we have tr B : xx -E W tr B : xx = tr B -1:

.3.5 DIFFERENTIATION RESULTS Lemma E.21. For a < b, let f : [a, b] → R be a continuous function that is differentiable on (a, b) except at a set of isolated points in (a, b), and let c ∈ R. Then max{f, c} is differentiable except at a set of isolated points in (a, b). Proof. Let A ⊂ (a, b) denote the set of points of differentiability of f , and let B ⊂ (a, b) denote the set of points of nondifferentiability of max{f, c}. Because finite unions of isolated sets of points in (a, b) are isolated in (a, b), it suffices to consider only points x ∈ A.

The fact that j 1 + • • • + j m = k implies that there are k factors in the denominator, so we can put the factors in the numerator and denominator into one-to-one correspondence. Consider the ordering of the factors in the denominator ( j1 l=1 (2l -1)) . . . ( jm l=1 (2l -1)). Then k i=1 (2i -1) j1 i=1 (2i -1) = k i=j1+1 (2i -1).

x)| ≤ 4 4 e -4 + 2 • 3 3 e -3 .Combining these bounds with our lower bound on f (x) + f (c -x) and repeatedly applying the triangle inequality and modulus bounds in (E.50) and (E.51), then subsequently in (E.49) (using also |x| ≤ c), we conclude the claimed bounds on |φ c | and |φ c |.

this is immediate, since on the event E b we have | vν 2 2 -v 0 2 2 | ≤ 20. It thus follows from Lemma E.48 that with probability at least 1 -Ce -cn -C n -2d+1/2 we have ∀ν ∈ [0, π], v ν , vν ≤ C d log n n . (E.64)

• ); square-integrability of X and Y is evident from the definition of 1 E , and we have |X| ≤ 1 by Cauchy-Schwarz. To control Y , we start by noting

[σ(g 11 cos ν + g 21 sin ν)(g 21 cos ν -g 11 sin ν)] But we have using rotational invariance that E[σ(g 11 cos ν + g 21 sin ν)(g 21 cos ν -g 11 sin ν)] = 0, which impliesE v 0 , v ν 2 v ν , vν 4 ≤ C/n, from which we conclude for all ν |E[Ξ 6 (ν, g 1 , g 2 )]| ≤ C/ √ n.

ν -C 3 , c ≤ ν ≤ π/2.We define D = 20/(π/2 -c) 3 , A = 2C 2 /π + C 4 , B = 2C 3 /π + πC 4 /2 + C 1 , and C = C 3 , so that the RHS can be written as -D(π/2 -ν) 3 -Aν 2 + Bν -C. Differentiating once and equating to zero results in the quadratic equation 4N . Numerically estimating the constants, we get that the two roots lie in [0.99, 1] and [

P sup x∈S f (x, • ) > t + Lε ≤ P sup x∈Nε f (x, g) ≤ t + P[E] ≤ δ t 1 + 2M ε d + P[E],(E.80) as claimed.

sup t∈[0,kη],x∈M, ∈[L]

v 2 ≤η supp ρ 0 (x) > 0 supp ρ 0 (x) + v > 0 (F.8)

Proof. Let |g| (1) ≤ |g| (2) ≤ • • • ≤ |g| (n)denote the order statistics of the magnitudes of the elements of g. We will show that bounding δ η (g) can be reduced to understanding what is the smallest k such that|g| 2 (1) + • • • + |g| 2 (k) ≥ η 2 .We denote this value of k by k η . Define indices j i by |g Ji | = |g| (i) (and breaking ties arbitrarily in case several order statistics are equal). To see thatk η -1 ≤ δ η (g) ≤ k η (< η 2 one can choose ε > 0 small enough such that y = gε)g Ji e Ji ∈ B E (g, η), which will give d s (g, y) = k η -1 ⇒ δ η ≥ k η -1.To prove the second inequality, consider y = gδη i=1

4 (L)d 3/4 L 3+2q/3 n 5/12 .

Then on an event of probability at least 1 -e -cd , one has simultaneouslysup x∈M, ∈[L], k ∈{0,...,k+1} ρ k (x) -ρ 0 (x) 2 ≤ C L 3/2+q d n , sup x∈M, ∈[L], k ∈{0,...,k+1} β -1 k (x) -β -1 0 (x) 2 -β -1 I k 0 (x) -β -1 I00 (x) 2 ≤ C √ dL 3/2+q ,for some constants c, C, C .

all k ≤ k + 1, ρ k (x) -ρ 0 (x) 2 ≤ ρ η k -ρ η 0 , (F.12) while β k (x) -β 0 (x) 2 = β I k (x),k -β I0(x),0 2 ≤ β I k (x),k -β I k (x),0 2 + β I k (x),0 -β I0(x)we can control the difference norms of the pre-activations and backward features by controlling the magnitudes of ρ η k , β η k .

descent. Towards this end, for any k ∈ {0, . . . , k} and s ∈ [0, 1] we compute at any point of differentiabilityα i-1 k +s (x), α i-1 k (x ) Γ :i+1 k +s (x)P I i,k +s (x) β i-1 k (x )ζ N k (x )dµ N (x ).where we used Jensen's inequality in the second line and our assumption that the error up to iteration k has bounded L 2 µ N norm, and we additionally assumed ρ η t Arguing as in the proof of Lemma B.8 for absolute continuity, it follows that

F.18) to satisfy the second and third condition in (F.19) gives an identical constraint on τ t.

) and (F.22), on G we have supt∈[0,k+1],x∈M, ∈[L]

constraint can be satisfied by choosing C η = C √ d. Note that the constant C 2 in (F.21) (which enters C ) is set in lemma F.2 which takes C η as input (despite this, C 2 is independent of C η ). This lemma holds as long as n ≥ C 3 0 C 2 η L 6+2q , which we can guarantee by demanding n ≥

sup x∈M, ∈[L], k ∈{0,...,k+1} ρ k (x) -ρ 0 (x) 2 ≤ C L 3The combination of the last two lemmas allows us to control the changes in all the forward and backward features uniformly: Lemma F.4. Assume n, L, d, k satisfy the requirements of lemmas F.2 and F.3, and additionally n ≥ KL 36+8q d 9 for some K. Then one has simultaneously on an event of probability at least1 -e -cd sup x∈M, t∈[0,k+1], ∈[L] α t (x) -α 0 (x) 2 ≤ CL 3-β -1 0 (x) 2 ≤ C log 3/4 (L)d 3/4 L 3+2q/3 n 5/12 , sup x∈M, t∈[0,k+1], ∈[L] α t (x) 2 ≤ C, sup x∈M, t∈[0,k+1], ∈[L]Combine the results of lemmas F.2 and F.3 and take a union bound, using the triangle inequality to obtain the second two bounds. The assumption n ≥ KL 36+8q d 9 is required in showingβ -1 t (x) 2 ≤ C √ n. F.4 CHANGES IN Θ N k DURING TRAININGWith these results in hand, control of the changes in Θ N k during training is straightforward.

x), α 0 (x ) β 0 (x), β 0 (x ) ,∆N k = sup (x,x )∈M×M, k ∈{0,...,k} Θ N k (x, x ) -Θ(x, x ) .Assume n ≥ KL 36+8q d 9 , d ≥ K d 0 log (nn 0 C M ) for constants K, K . Then on an event of probability at least 1 -e -cd ∆N k ≤ C log 3/4 (L)d 3/4 L 4+2q/3 n 11/12 for some constants c, C. k+s (x), α k (x ) β k+s (x), β k (x ) ds

k+s (x), α k (x ) β k+s (x), β k (x ) -α 0 (x), α 0 (x ) β 0 (x), β 0 (x ) k+s (x) -α 0 (x), α k (x ) β k+s (x), β k (x ) + α 0 (x), α k (x ) -α 0 (x ) β k+s (x), β k (x ) + α 0 (x), α 0 (x ) β k+s (x) -β 0 (x), β k (x ) + α 0 (x), α 0 (x ) β 0 (x), β k (x ) -β 0 (x ) (x) -α 0 (x) 2 + n α k (x ) -α 0 (x ) 2 + √ n β k+s (x) -β 0 (x) 2 + √ n β k (x ) -β 0 (x ) 2≤C L 3/2+q √ dn + log 3/4 (L)d 3/4 L 3+2q/3 n 11/12

is equal in distribution to a convex function of (E 1 , ..., E k ) after truncation (which can be seen by calculating second derivatives). The Lipschitz constant of this function is bounded by 4kn-k . = λ.If we define rescaled variables Ẽi = λE i then with the same probability they take values in [0, Kλ].

d , and due to our choice of η, this result holds for all k ≥ Cnη 2/3 ≥ CC 2/3 η n 2/3 L 1+2q/3 . Proof of lemma F.2. i) We begin by controlling the pre-activation norms. Considering a point x ∈ N n -3 n -1

supp (S, S ) = |S S | and denote by B s (S, δ) ⊂ P([n]) a ball defined with respect to this metric, where δ ∈ {0} ∪ [n] and P(A) is the power set of a set A. For δ η (y) and B(y, η) defined in (F.8) and (F.4) respectively, it is clear that B(y, η) ⊆ B s (supp(y > 0), δ η (y)) and consequentlyJ η (M) ⊆ x∈M L ⊗ =1B s (supp(ρ 0 (x) > 0), δ η (ρ 0 (x))).

For any S, S ∈ [n] and δ ∈ {0} ∪ [n], the triangle inequality impliesB s (S, δ) ⊆ B s (S , δ + d s (S, S )).For some p ∈ B E (g, r), we also haveδ η (p) = max y∈B E (p,η) d s (p, y) ≤ d s (p, g) + max y∈B E (p,η) d s (g, y) ≤d s (p, g) + max y∈B E (g,η+r) d s (g, y) =d s (p, g) + δ η+r (g) (F.27)where we used B E (p, η) ⊆ B E (g, η + r). It follows thatB s (supp(p > 0), δ η (p)) ⊆B s (supp(g > 0), δ η (p) + d s (p, g))⊆B s (supp(g > 0), δ η+r (g) + 2d s (p, g)).(F.28)From (D.62) and lemma G.10 we obtain that for d satisfying the requirements of lemma D.8,P   ∀x ∈ M, ∈ [L] : ∃x ∈ N n -3 n -1/2 0 ∩ N n -3 n -1/2 0 (x) s.t. ρ 0 (x) -ρ 0 (x) 2 ≤ Cn -5/2 ∩ d s (ρ 0 (x), ρ 0 (x)) ≤ d   ≥ 1 -6e -d/2 , (F.29)since under the assumptions of the lemma d s (ρ 0 (x), ρ 0 (x)) ≤ L =1

2 < 1 ≤ C e -cd ≤ Ce -c dfor some constants, assuming d > K log L for some K. Since on the complement of this event we have , lemma F.1 and a rescaling givesP δ 2η (ρ 0 (x)) ≥ C 0 nη 2/3 = P δ √ 2η α -1 (x) -1 2 (g) ≥ C 0 nη 2/3 ≤ P δ 2η (g) ≥ C 0 nη 2/3 + P √ 2 α -1 (x) 2 < 1 ≤ Ce -cd + C e -c d ≤ C e -c d . (F.31)for some constants. Taking a union bound over N n -3 n -1/2 0 and [L] we obtainP ∃x ∈ N n -3 n -1/2 0 , ∈ [L] s.t. δ 2η (ρ 0 (x)) ≥ C 0 nη 2/3 ≤ N n -3 n -1/2 0 LCe -cd ≤ C e -c d (F.32) under the same assumptions on d as in lemma D.8, and additionally assuming d ≥ K log L for some K.Since n, d, η satisfy the assumptions of lemma F.1, we have nη 2/3 ≥ C n 1/2 d 3/4 ≥ C d for some C and henceP ∃x ∈ N n -3 n -1/2 0 , ∈ [L] s.t. δ 2η (ρ 0 (x)) + 2d ≥ C 1 nη 2/3 ≤ Ce -cd (F.33)for some constants c, C, C 1 . Denoting the complement of above event by E N δ , we find that on E ρ ∩ E N δ , for every x we can find x ∈ N n -3 n -1/2 N n -3 n -1/2 0 (x) such that B s (supp(ρ 0 (x) > 0), δ η (ρ 0 (x))) ⊆ B s (supp(ρ 0 (x) > 0), δ 2η (ρ 0 (x)) + 2d) ⊆ B s (supp(ρ 0 (x) > 0), C 1 nη 2/3 ),

VolB s (supp(ρ 0 (x)), C 1 nη 2/3 )≤ C N n -3 n -1/2 0 CdLnη 2/3 ≤ C e C dLnη2/3 for appropriate constants, since nη 2/3 ≥ C d and d satisfies the assumptions of lemma D.8. Since after worsening constants we have P E ρ ∩ E N δ ≤ C e -cd , we obtain P S η > C e C dLnη 2/3 ≤ C e -cd (F.34)

3 n 2/3 -c n L ≥ 1 -C e -c n Lassuming n ≥ KL 9+2q d for some constant K. Taking a union bound over the probabilities of E J or E Kδ not holding, we finally obtainP ∀J ∈ J η (M), < ∈ [L], : C e -c n L -P E c J -P [E c δK ] ≥ 1 -C e -c n L -C e -c d -C e -c d ≥ 1 -C e -c dfor appropriate constants, where we used (F.35) to bound P [E c δK ], and in the last inequality we used n ≥ KLd for some K. Combining this with (F.25) and taking a union bound gives the desired result.

and P [E δK ] = P [E δ ] ≥ 1 -Ce -cd where the last bound uses the definition of E δ in (F.35). From the definition of d in lemma D.21, on the event E δK we have d ∞ ≤ δ s ≤ C 0 nη 2/3 .

For the first term in (B.42), we use Lemmas C.7, C.22 and C.24 to obtain that x → (ψ 1 •∠(x, x )) 2 is bounded by Cn 2 L 4 and C n 2 L 5 -Lipschitz for every x, and then applying (B.32) gives

and C lip ≥ 1 and worst-casing exponents on d and log L in the first line, and (B.68) in the second line. In particular, by the value of M in this regime, if n ≥ (3C e 14/δ ) 18 L 60+32q d 15 log 9 L

) .

) is also continuous. Continuity of the features as a function of the parameters and of γ N Following the proof of Lemma E.21, we see that the points of nondifferentiability of t → f γ N k (t) (x) are contained in the set of points of [0, 1] where there exists a layer at which at least one of the coordinates of α γ N k ( • ) (x) vanishes. Applying the chain rule at points of differentiability of the ReLU [ • ] + and assigning 0 otherwise, it follows that the derivative of

1⊥ W * P J Γ : +1 *

S j⊥ ) (kj ,kj )

Using the growth estimate for f , we have|g(t, x -u cos t)| ≤ C |x -u cos t| p |x cos t -u| sin p+2 t exp -(x -u cos t) 2 2c 2 sin 2 t ,where C > 0 depends only on c. We are going to bound this quantity under the assumption that |x| ≤ |u|/2, where we use the assumption |u| > 0. First, note that when π/4 ≤ t ≤ 3π/4, we have sin t ≥ 1/ √ 2, and we always have sin t ≤ 1 for 0 ≤ t ≤ π; so in this regime |g(t, x -u cos t)| ≤ C2 p/2+1 |x -u cos t| p |x cos t -u| exp -

1/2 , and the checks at and around (E.36) in the proof of Lemma E.15 show that we can apply Lemma E.30 to obtainE v 0 , v ν 2 v ν , vν 4 -n 6 E[σ(g 11 )σ(g 11 cos ν + g 21 sin ν)]

which gives the bound ν 2 h 2 (ν) ≥ -11π 2 /4 on [0, π/2]. Putting it all together, we have

Taking limits ν 0, we can assert this bound on [0, √ 2], and the bound is clearly an increasing function of ν, from which it follows where the last inequality follows from a numerical estimate of the constants. On the other hand, when √ 2 < ν ≤ π/2, we have from (E.85) thatIf we differentiate the degree four polynomial on the RHS of this bound and solve for critical points, we find a double critical point at ν = 0 and a critical point at ν = 12π 2 /83; a numerical estimate confirms that this critical point lies in the interior of[ √ 2, π/2].The second derivative of the RHS is -(4/π)ν + 83/(2π 3 )ν 2 , and plugging in ν = 12π 2 /83 gives a value of -48π/83 + 144π/83, which is positive; hence the RHS is maximized on the boundary, i.e.,

+s (x)dsSince the above holds for all choices of x, simultaneously, we conclude that Instead of solving (F.18), we obtain sufficient control by defining k * s.t.For any t ∈ [0, k * ], we obtain a sufficient condition for satisfying the above constraint using (F.18),

ACKNOWLEDGMENTS

This work was supported by the grants NSF 1733857, NSF 1838061, NSF 1740833, NSF 1740391, NSF NeuroNex Award DBI-1707398 (DG), the Gatsby Charitable Foundation (DG) and a Swartz fellowship (DG), and by a fellowship award (SB) through the National Defense Science and Engineering Graduate (NDSEG) Fellowship Program, sponsored by the Air Force Research Laboratory (AFRL), the Office of Naval Research (ONR) and the Army Research Office (ARO). The authors would like to thank Ethan Dyer, Guy Gur-Ari, Quynh Nguyen, Jeffrey Pennington, Sam Schoenholz, Daniel Soudry, and Tingran Wang for helpful discussions/feedback.

E.2 MAIN RESULTS

Lemma E.1. There exist absolute constants c, C, C > 0 and absolute constants K, K > 0 such that if d ≥ K and n ≥ K d 4 log 4 n, then one hasProof. Using the triangle inequality, we can writeChoose n sufficiently large to satisfy the hypotheses of Lemmas E.6 and E.7; applying these lemmas to bound the first and second terms, we conclude the claimed result (after choosing n larger than an absolute constant multiple of d log n so that the n -cd error dominates the e -c n error).Lemma E.2. One hasProof. See (Cho & Saul, 2009) .Lemma E.3. There exist absolute constants c, C, C , C > 0 and absolute constants K, K > 0 such that if d ≥ K and n ≥ K d 4 log 4 n, then one has with probability at leastThe constant c is the same as the constant appearing in Lemma E.1.Proof. Under our hypothess, the second result in Lemma E.6 together with Lemma E.1 and the triangle inequality imply the claimed result (after worst-casing multiplicative constants).Lemma E.4. There exist absolute constants c, C, C > 0 and absolute constants K, K > 0 such that if d ≥ K and n ≥ K d 4 log 4 n, then one hasThe constant c is the same as the constant appearing in Lemma E.1.Proof. Under our hypotheses, Lemma E.3 is applicable; we let E denote the event corresponding to the bound in this lemma. By boundedness of cos -1 , nonnegativity of X ν , and ϕ ≤ π/2 from Lemma E.2, we haveas claimed.Lemma E.5. One has 1. ϕ ∈ C ∞ (0, π), and φ and φ extend to continuous functions on [0, π];2. ϕ(0) = 0 and ϕ(π) = π/2; φ(0) = 1, φ(0) = -2/(3π), and ... ϕ (0) = -1/(3π 2 ); and φ(π) = φ(π) = 0; 3. ϕ is concave and strictly increasing on [0, π] (strictly concave in the interior); Thus, by a union bound and our previous results, we havewhich is the desired measure bound.Lemma E.17. We have for each fixed ν ∈ [0, π] that:1. The coordinates of vν have subgaussian moment growth3. The event {∀ν ∈ [0, π] vν 2 ≤ 4} has probability at least 1 -e -c n .Proof. We have that the coordinates of vν are i.i.d., andby rotational invariance. By independence of g 1 and g 2 , we compute2n p/2 p p/2 , for each p ≥ 1; the last inequality follows from Lemma G.11. This shows that the coordinates of vν are independent subgaussian random variables with scale parameters at most C 2/n, so we have a tail bound (Vershynin, 2018, Theorem 3.1.1 ), also taking into account that E ( vν ) 2 i = 1/n. This shows that the event E = { vν 2 ≤ 2} has probability at least 1 -e -cn .For the third assertion, we use the triangle inequality to get vν 2 ≤ g 2 2 + g 2 2 , which has RHS independent of ν; then applying Gauss-Lipschitz concentration gives for t ≥ 0]. Putting t = 0.5 in this bound and applying a union bound, we conclude that there is an event of probability at least 1 -e -cn on which vν 2 ≤ 4 uniformly in ν.Lemma E.18. There exists an absolute constant C > 0 such that if n ≥ C, one haswhere c, C , C > 0 are absolute constants.Proof. For the upper bound, we apply the Schwarz inequality to get≤ 1, by rotational invariance and Lemma G.11. For the lower bound, we will truncate and linearize the product using logarithms. Let E = E 0.5,0 ; by Lemma E.16, as long as n ≥ 20 we have µwhere we apply the Schwarz inequality in the third line. Consequently, with probability at least 1 -C e -cd , we haveLemma E.34. For i = 1, . . . , n, let X i , Y i be random variables in L 4 , and let d > 0 and δ > 0. Suppose X i ≥ 0 for each i andProof. The proof is a minor elaboration on Lemma E.33. We apply the triangle inequality:where the second line holds with probability at least 1 -δ. Another application of the triangle inequality together with nonnegativity of the X i giveswhere the second line applies the Lyapunov inequality. By the Schwarz inequality and the Lyapunov inequality, we haveConsequently, with probability at least 1 -δ, we haveas claimed, where we use that C d/n ≤ 1 here.Lemma E.35. Let k ∈ N, and let X 1 , . . . , X k be integrable random variables satisfying Xevent E b = E ∩ E a , which has probability at least 1 -Ce -cn by a union bound, we have using Cauchy-Schwarz that for every νUsing the high probability deviations bound established in (E.64), it follows that if n is large enough then with probability at least 1 -Ce -cn -C n -2d+1/2 we haveAs long as d ≥ 1 2 , we have that this probability is at least 1 -Ce -cn -C n -d , and so the triangle inequality yields finally that with probability at least 1 -Lemma E.47. Consider the functionand the lower bound is positive ifProof. To see that the lower bound is positive under the stated condition, writethe quantity in parentheses is positive in a neighborhood of zero by continuity, and in fact one calculates for its unique zero ν 0 = 48π 2 /249, and one verifies numerically that 48π 2 /249 > 1.9 > π/2. We conclude that the bound is positive for 0 < ν < 1.9 by continuity.To establish the bound, we employ Taylor expansion of the numerator, which is a smooth function on (0, π) with continuous derivatives of all orders on [0, π], in a neighborhood of zero. In our development in the proof of Lemma E.5, we showed that the analytic function -g(ν) = -(2π 2 /3)ν 3 + O(ν 4 ) near zero, so Taylor's theorem with Lagrange remainder impliesand so it suffices to get suitable bounds on the fourth derivative of g. We will develop the bounds rather tediously. Start by distributing in g to write.Using the Leibniz rule, we have for the fourth derivative2 (ν) + 12g(3)1 (ν) + 8g(3) 2 (ν) + 36g(2)0 (ν) + 4g(3) 1 (ν) + 12g(2) 2 (ν) + 24g(1) 3 (ν) .To calculate these derivatives, we just need to differentiate sin, cos, and their third powers. Write c(ν) = cos 3 (ν) and s(ν) = sin 3 (ν); using the elementary calculationsone can calculate the results2 (ν) = 3π cos ν + sin ν, g0 (ν) = (2π 2 -60) sin ν + 50π cos ν + 81π cos 3 ν -81 sin 3 ν; and1 (ν) = (7 -2π 2 ) sin ν + 2π cos ν -27 cos 2 ν sin ν; and g(2)2 (ν) = -3π cos ν -sin ν; and finally g(1)3 (ν) = sin ν. Plugging back into (E.75) and canceling, we get.Since ν > 0, we can leverage lower bounds on each h i term. We have trivially |h 3 | ≤ 1, so that |ν 3 h 3 (ν)| ≤ π 3 /8. We will study νh 1 (ν) + h 0 (ν) together to get a better bound. We have, (E.76) using ν ≤ π/2 and cos ≥ 0 on this domain. We will show that the RHS of the final inequality, denoted q, is a decreasing function of ν, and is therefore lower bounded by its value at ν = π/2 on our interval of interest. We calculateReordering terms, we can writewhich shows that C 1 , C 2 , C 3 , C 4 > 0 and both of the linear prefactors are decreasing functions of ν. We have on all of (0, π/2) by concavity of sinPublished as a conference paper at ICLR 2021

E.4 DEFERRED PROOFS

Proof of Lemma E.5. The function cos -1 is C ∞ on (-1, 1), and because f (ν) := cos ϕ(ν) is smooth and satisfies f (ν) = (π -1 ν -1) sin ν < 0 if ν < π with f (0) = 1 and f (π) = 0, we see that ϕ is C ∞ on (0, π) by the chain rule. This also shows ϕ(0) = cos -1 (1) = 0 and ϕ(π) = cos -1 (0) = π/2. Direct calculation givesCalculating endpoint limits using these expressions will suffice to show the derivatives are continuous on [0, π] and give the claimed values there. We haveby L'Hôpital's rule, whereas a direct evaluation givesContinuity of the square root function gives the claimed results for φ. Again by direct calculation, we find lim ν 0 φ(ν) 2 = 0 π 3 = 0. Since φ2 is meromorphic in a neighborhood of 0 with, as we have shown, a removable singularity at 0, it is actually analytic, and we can calculate further derivatives at 0 by expanding it locally at 0. We use the expansions sin ν = ν -ν 3 /6 + O(ν 5 ) and cos ν = 1 -ν 2 /2 + ν 4 /24 + O(ν 6 ) near 0 to calculate.By the geometric series, we then obtain). Taking the square root of this expression and applying the binomial series, we thus haveThus for some fixed t ∈ [0, T η ], if we denotewe can apply the second result of lemma D.21, choosingfor some appropriately chosen constants K, K , K .Assuming n ≥ dL 2 and n 1/12 √ L 3+2q/3 d 1/4 ≥ K for some constant K to simplify the result, we obtain+ C e -c n L . The constants K, K are chosen such that this result can be uniformized over the set of possible supports J η (M) and [L] . Since on the event E J defined in (F.37) the size of this set is bounded, we havefor some constant c, C, C , assuming n ≥ K L 9+2q d for some constant K . Taking a union bound over the complements of E J and E δK using (F.37) and (F.35), we have≤ K C 2/3 η n 5/12 log 3/4 (L)L 3+2q/3 d 3/4 ≥1 -C e -cd0L 2+2q/3 n 2/3 -C e -c d ≥1 -C e -c d for appropriate constants, assuming log(L)L 2+2q/3 n 2/3 > K for some constant K.

G AUXILIARY RESULTS

Lemma G.1 (Hoeffding's Inequality (Vershynin, 2018, Theorem 2.2.6) ). Let X 1 , . . . , X N be independent random variables. Assume that X i ∈ [m i , M i ] for every i. Then for any t > 0, we haveLemma G.2 (Bernstein's inequality (Vershynin, 2018, Theorem 2.8 .1)). Let X 1 , . . . , X N be independent mean-zero subexponential random variables. Then, for every t ≥ 0, one haswhere c > 0 is an absolute constant, and 

