ON THE RELATIVE ERROR OF RANDOM FOURIER FEATURES FOR PRESERVING KERNEL DISTANCE

Abstract

The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS'07), is a powerful technique to find approximate lowdimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with relative error is less understood. We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions. We complement this by showing as long as the shift-invariant kernel is analytic, RFF with poly(ε -1 log n) dimensions achieves ε-relative error for pairwise kernel distance of n points, and the dimension bound is improved to poly(ε -1 log k) for the specific application of kernel k-means. Finally, going beyond RFF, we make the first step towards dataoblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar poly(ε -1 log n) dimension bound for Laplacian kernels. We also validate the dimension-error tradeoff of our methods on simulated datasets, and they demonstrate superior performance compared with other popular methods including random-projection and Nyström methods. K(x, y), for shift-invariant kernels (i.e., there exists K : P → R, such that K(x, y) = K(x -y)) which includes widely-used Gaussian kernels, Cauchy kernels and Laplacian kernels. Towards understanding this fundamental method of RFF, a long line of research has focused on analyzing the tradeoff between the target dimension D and the accuracy of approximating K under certain error measures. This includes additive error max x,y |⟨φ(x), φ(y)⟩ -⟨φ ′ (x), φ ′ (y)⟩| (

1. INTRODUCTION

We study the ability of the random Fourier features (RFF) method (Rahimi & Recht, 2007) for preserving the relative error for the kernel distance. Kernel method (Schölkopf & Smola, 2002) is a systematic way to map the input data into a (indefinitely) high dimensional feature space to introduce richer structures, such as non-linearity. In particular, for a set of n data points P , a kernel function K : P × P → R implicitly defines a feature mapping φ : P → H to a feature space H which is a Hilbert space, such that ∀x, y, K(x, y) = ⟨φ(x), φ(y)⟩. Kernel methods have been successfully applied to classical machine learning (Boser et al., 1992; Schölkopf et al., 1998; Girolami, 2002) , and it has been recently established that in a certain sense the behavior of neural networks may be modeled as a kernel (Jacot et al., 2018) . Despite the superior power and wide applicability, the scalability has been an outstanding issue of applying kernel methods. Specifically, the representation of data points in the feature space is only implicit, and solving for the explicit representation, which is crucially required in many algorithms, takes at least Ω(n 2 ) time in the worst case. While for many problems such as kernel SVM, it is possible to apply the so-called "kernel trick" to rewrite the objective in terms of K(x, y), the explicit representation is still often preferred, since the representation is compatible with a larger range of solvers/algorithms which allows better efficiency. In a seminal work (Rahimi & Recht, 2007) , Rahimi and Recht addressed this issue by introducing the method of random Fourier features (see Section 2 for a detailed description), to compute an explicit low-dimensional mapping φ ′ : P → R D (for D ≪ n) such that ⟨φ ′ (x), φ ′ (y)⟩ ≈ ⟨φ(x), φ(y)⟩ = Lower Bounds. Surprisingly, we show that this is not the case. In particular, we show that for a wide range of kernels, including the well-known Laplacian kernels, it requires unbounded target dimension for RFF to preserve the kernel distance with constant multiplicative error. We state the result for a Laplacian kernel in the following, and the full statement of the general conditions of kernels can be found in Theorem 4.1. In fact, what we show is a quantitatively stronger result, that if the input is (∆, ρ)-bounded, then preserving any constant multiplicative error requires Ω(poly(∆/ρ)) target dimension. Here, a point x ∈ R d is (∆, ρ)-bounded if ∥x∥ ∞ ≤ ∆ and min i:xi̸ =0 |x i | ≥ ρ, i.e., the magnitude is (upper) bounded by ∆ and the resolution is (lower) bounded by ρ. Theorem 1.1 (Lower bound; see Remark 4.1). For every ∆ ≥ ρ > 0 and some feature mapping φ : R d → H of a Laplacian kernel K(x, y) = exp(-∥x -y∥ 1 ), if for every x, y ∈ R d that are (∆, ρ)-bounded, the RFF mapping π for K with target dimension D satisfies dist π (x, y) ∈ (1 ± ε) • dist φ (x, y) with constant probability, then D ≥ Ω( 1 ε 2 ∆ ρ ). This holds even when d = 1. Upper Bounds. Complementing the lower bound, we show that RFF can indeed preserve the kernel distance within 1 ± ε error using poly(ε -1 log n) target dimensions with high probability, as long as the kernel function is shift-invariant and analytic, which includes Gaussian kernels and Cauchy kernels. Our target dimension nearly matches (up to the degree of polynomial of parameters) that is achievable by the Johnson-Lindenstrauss transform (Johnson & Lindenstrauss, 1984) , which is shown to be tight (Larsen & Nelson, 2017) . This upper bound also greatly generalizes the result of Chen & Phillips (2017) which only works for Gaussian kernels (see Section G for a detailed comparison). Theorem 1.2 (Upper bound). Let K : R d × R d → R be a kernel function which is shift-invariant and analytic at the origin, with feature mapping φ : R d → H for some feature space H. For every 0 < δ ≤ ε ≤ 2 -16 , every d, D ∈ N, D ≥ max{Θ(ε -1 log 3 (1/δ)), Θ(ε -2 log(1/δ))}, if π : R d → R D is an RFF mapping for K with target dimension D, then for every x, y ∈ R d , Pr[| dist π (x, y) -dist φ (x, y)| ≤ ε • dist φ (x, y)] ≥ 1 -δ. The technical core of our analysis is a moment bound for RFF, which is derived by analysis techniques such as Taylor expansion and Cauchy's integral formula for multi-variate functions. The moment bound is slightly weaker than the moment bound of Gaussian variables, and this is the primary reason that we obtain a bound weaker than that of the Johnson-Lindenstrauss transform. Finally, several additional steps are required to fit this moment bound in Bernstein's inequality, which implies the bound in Theorem 1.2. Improved Dimension Bound for Kernel k-Means. We show that if we focus on a specific application of kernel k-means, then it suffices to set the target dimension D = poly(ε -1 log k), instead of D = poly(ε -1 log n), to preserve the kernel k-means clustering cost for every k-partition. This follows from the probabilistic guarantee of RFF in Theorem 1.2 plus a generalization of the dimension-reduction result proved in a recent paper (Makarychev et al., 2019) . Here, given a data set P ⊂ R d and a kernel function K : R d × R d → R, denoting the feature mapping as φ : R d → H, the kernel k-means problem asks to find a k-partition C := {C 1 , . . . , C k } of P , such that cost φ (P, C) = k i=1 min ci∈H x∈Ci ∥φ(x) -c i ∥ 2 2 is minimized. Theorem 1.3 (Dimension reduction for clustering; see Theorem 3.1). For kernel k-means problem whose kernel function K : R d × R d → R is shift-invariant and analytic at the origin, for every data set P ⊂ R d , the RFF mapping π : R d → R D with target dimension D ≥ O( 1 ε 2 (log 3 k δ + log 3 1 ε )) , with probability at least 1 -δ, preserves the clustering cost within 1 ± ε error for every k-partition simultaneously. Applying RFF to speed up kernel k-means has also been considered in Chitta et al. (2012) , but their error bound is much weaker than ours (and theirs is not a generic dimension-reduction bound). Also, similar dimension-reduction bounds (i.e., independent of n) for kernel k-means were obtained using Nyström methods (Musco & Musco, 2017; Wang et al., 2019) , but their bound is poly(k) which is worse than our poly log(k); furthermore, our RFF-based approach is unique in that it is data-oblivious, which enables great applicability in other relevant computational settings such as streaming and distributed computing. Going beyond RFF. Finally, even though we have proved RFF cannot preserve the kernel distance for every shift-invariant kernels, it does not rule out the existence of other efficient data-oblivious dimension reduction methods for those kernels, particularly for Laplacian kernel which is the primary example in our lower bound. For instance, in the same paper where RFF was proposed, Rahimi and Recht (Rahimi & Recht, 2007) also considered an alternative embedding called "binning features" that can work for Laplacian kernels. Unfortunately, to achieve a relative error of ε, it requires a dimension that depends linearly on the magnitude/aspect-ratio of the dataset, which may be exponential in the input size. Follow-up works, such as (Backurs et al., 2019) , also suffer similar issues. We make the first successful attempt towards this direction, and we show that Laplacian kernels do admit an efficient data-oblivious dimension reduction. Here, we focus on the (∆, ρ)-bounded case, Here, we use a similar setting to our lower bound (Theorem 1.1) where we focus on the (∆, ρ)bounded case. Theorem 1.4 (Oblivious dimension-reduction for Laplacian kernels, see Theorem F.1). Let K be a Laplacian kernel, and denote its feature mapping as φ : R d → H. For every 0 < δ ≤ ε ≤ 2 -16 , every D ≥ max{Θ(ε -1 log 3 (1/δ)), Θ(ε -2 log(1/δ))}, every ∆ ≥ ρ > 0, there is a mapping π : R d → R D , such that for every x, y ∈ R d that are (∆, ρ)-bounded, it holds that Pr[| dist π (x, y) -dist φ (x, y)| ≤ ε • dist φ (x, y)] ≥ 1 -δ. The time for evaluating π is dD • poly(log ∆ ρ , log δ -1 ). Our target dimension only depends on log ∆ ρ which may be interpreted as the precision of the input. Hence, as an immediate corollary, for any n-points dataset with precision 1/ poly(n), we have an embedding with target dimension D = poly(ε -1 log n), where the success probability is 1 -1/ poly(n) and the overall running time of embedding the n points is O(n poly(dε -1 log n)). Our proof relies on the fact that every ℓ 1 metric space can be embedded into a squared ℓ 2 metric space isometrically. We explicitly implement an approximate version of this embedding (Kahane, 1981) , and eventually reduce our problem of Laplacian kernels to Gaussian kernels. After this reduction, we use the RFF for Gaussian kernels to obtain the final mapping. However, since the embedding to squared ℓ 2 is only of very high dimension, to implement this whole idea efficiently, we need to utilize the special structures of the embedding, combined with an application of space bounded pseudo-random generators (PRGs) (Nisan, 1992) . Even though our algorithm utilizes the special property of Laplacian kernels and eventually still partially use the RFF for Gaussian kernels, it is still of conceptual importance. It opens up the direction of exploring general methods for Johnson-Lindenstrauss style dimension reduction for shift-invariant kernels. Furthermore, the lower bound suggests that the Johnson-Lindenstrauss style dimension reduction for general shift-invariant kernels has to be not differentiable, which is a fundamental difference to RFF. This requirement of "not analytical" seems very counter-intuitive, but our construction of the mapping for Laplacian kernels indeed provides valuable insights on how the non-analytical mapping behaves. Experiments and Comparison to Other Methods. Apart from RFF, the Nyström and the random-projection methods are alternative popular methods for kernel dimension reduction. In Section 6, we conduct experiments to compare their empirical dimension-error tradeoffs with that of our methods on a simulated dataset. Since we focus on the error, we use the "ideal" implementation of both methods that achieve the best accuracy, so they are only in favor of the two baselines -for Nyström, we use SVD on the kernel matrix, since Nyström methods can be viewed as fast and approximate low-rank approximations to the kernel matrix; for random-projection, we apply the Johnson-Lindenstrauss transform on the explicit representations of points in the feature space. We run two experiments to compare each of RFF (on a Gaussian kernel) and our new algorithm in Theorem 1.4 (on a Laplacian kernel) with the two baselines respectively. Our experiments indicate that the Nyström method is indeed incapable of preserving the kernel distance in relative error, and more interestingly, our methods perform the best among the three, even better than the Johnson-Lindenstrauss transform which is the optimal in the worst case.

1.2. RELATED WORK

Variants of the vanilla RFF, particularly those that use information in the input data set and/or sample random features non-uniformly, have also been considered, including leverage score sampling random Fourier features (LSS-RFF) (Rudi et al., 2018; Liu et al., 2020; Erdélyi et al., 2020; Li et al., 2021) , weighted random features (Rahimi & Recht, 2008; Avron et al., 2016; Chang et al., 2017; Dao et al., 2017) , and kernel alignment (Shahrampour et al., 2018; Zhen et al., 2020) . The RFF-based methods usually work for shift-invariant kernels only. For general kernels, techniques that are based on low-rank approximation of the kernel matrix, notably Nyström method (Williams & Seeger, 2000; Gittens & Mahoney, 2016; Musco & Musco, 2017; Oglic & Gärtner, 2017; Wang et al., 2019) and incomplete Cholesky factorization (Fine & Scheinberg, 2001; Bach & Jordan, 2002; Chen et al., 2021; Jia et al., 2021) ) were developed. Moreover, specific sketching techniques were known for polynomial kernels (Avron et al., 2014; Woodruff & Zandieh, 2020; Ahle et al., 2020; Song et al., 2021) , a basic type of kernel that is not shift-invariant.

2. PRELIMINARIES

Random Fourier Features. RFF was first introduced by Rahimi and Recht (Rahimi & Recht, 2007) . It is based on the fact that, for shift-invariant kernel K : R d → R such that K(0) = 1 (this can be assumed w.l.o.g. by normalization), function p : R d → R such that p(ω) = 1 2π R d K(x)e -i⟨ω,x⟩ dx, which is the Fourier transform of K(•), is a probability distribution (guaranteed by Bochner's theorem (Bochner, 1933; Rudin, 1991) ). Then, the RFF mapping is defined as π(x) := 1 D       sin⟨ω 1 , x⟩ cos⟨ω 1 , x⟩ . . . sin⟨ω D , x⟩ cos⟨ω D , x⟩       where ω 1 , ω 2 , . . . , ω D ∈ R d are i.i.d. samples from distribution with densitiy p. Theorem 2.1 (Rahimi & Recht 2007) . E[⟨π(x), π(y)⟩] = 1 D D i=1 E[cos⟨ω i , x -y⟩] = K(x -y). Fact 2.1. Let ω be a random variable with distribution p over R d . Then ∀t ∈ R, E[cos (t⟨ω, x -y⟩)] = ℜ R d p(ω)e i⟨ω,t(x-y)⟩ dω = K(t(x -y)), and Var(cos ⟨ω, x -y⟩) = 1+K(2(x-y))-2K(x-y) 2

2

.

3. UPPER BOUNDS

We present two results in this section. We start with Section 3.1 to show RFF preserves the relative error of kernel distance using poly(ε -1 log n) target dimensions with high probability, when the kernel function is shift-invariant and analytic at origin. Then in Section 3.2, combining this bound with a generalized analysis from a recent paper (Makarychev et al., 2019) , we show that RFF also preserves the clustering cost for kernel k-clustering problems with ℓ p -objective, with target dimension only poly(ε -1 log k) which is independent of n. 3.1 PROOF OF THEOREM 1.2: THE RELATIVE ERROR FOR PRESERVING KERNEL DISTANCE Since K is shift-invariant, we interpret K as a function on R d instead of R d × R d , such that K(x, y) = K(x -y). As in Section 2, let p : R d → R be the Fourier transform of K, and suppose in the RFF mapping π, the random variables ω 1 , . . . , ω d ∈ R d are i.i.d. sampled from the distribution with density p. When we say K is analytic at the origin, we mean there exists some constant r s.t. K is analytic in {x ∈ R d : ∥x∥ 1 < r}. We pick r K to be the maximum of such constant r. Also notice that in D ≥ max{Θ(ε -1 log 3 (1/δ)), Θ(ε -2 log(1/δ))}, there are constants about K hidden inside the Θ, i.e. R K as in Lemma 3.2. Fact 3.1. The following holds. • dist π (x, y) = 2 -2/D D i=1 cos⟨ω i , x -y⟩, and dist φ (x, y) = 2 -2K(x -y). • Pr[| dist π (x, y) -dist φ (x, y)| ≤ ε • dist φ (x, y)] ≥ Pr[| dist π (x, y) 2 -dist φ (x, y) 2 | ≤ ε • dist φ (x, y) 2 ]. Define X i (x) := cos⟨ω i , x⟩ -K(x). As a crucial step, we next analyze the moment of random variables X i (x -y). This bound will be plugged into Bernstein's inequality to conclude the proof. Lemma 3.1. If for some r > 0, K is analytic in {x ∈ R d : ∥x∥ 1 < r}, then for every k ≥ 1 being even and every x s.t. ∥x∥ 1 < r, we have E[|X i (x)| k ] ≤ 4k∥x∥ 1 r 2k . Proof. The proof can be found in Section A. Lemma 3.2. For kernel K which is shift-invariant and analytic at the origin, there exist c K , R K > 0 such that for all ∥x∥ 1 ≤ R K , 1-K(x) ∥x∥ 2 1 ≥ c K 2 . Proof. The proof can be found in Section B. Proof sketch of Theorem 1.2. We present a proof sketch for Theorem 1.2, and the full proof can be found in Section C. We focus on the case when ∥x -y∥ 1 ≤ R K (the other case can be found in the full proof). Then by Lemma 3.2, we have 2 -2K(x -y) ≥ c∥x -y∥ 2 1 . Then we have: Pr 2 D D i=1 X i (x -y) ≤ ε • (2 -2K(x -y)) ≥ Pr 2 D D i=1 X i (x -y) ≤ cε • ∥x -y∥ 2 1 . We take r = r K for simplicity of exhibition. Assume δ ≤ min{ε, 2 -16 }, let k = log(2D 2 /δ), t = 64k 2 /r 2 , is even. By Markov's inequality and Lemma 3.1: Pr[|X i (x -y)| ≥ t∥x -y∥ 2 1 ] = Pr |X i (x -y)| k ≥ t k ∥x -y∥ 2k 1 ≤ (4k) 2k t k r 2k = 4 -k ≤ δ 2D 2 . For simplicity denote X i (x -y) by X i , ∥x -y∥ 2 1 by ℓ and define X ′ i = 1 [|Xi|≥tℓ] tℓ • sgn(X i ) + 1 [|Xi|<tℓ] X i , note that E[X i ] = 0. By some further calculations and plugging in the parameters t, δ, D, we can eventually obtain E[|X ′ i |] ≤ δℓ. Denote σ ′2 as the variance of X ′ i , then again by Lemma 3.1 we immediately have σ ′ ≤ 64ℓ/r 2 . The theorem follows by a straightforward application of Bernstein's inequality.

3.2. DIMENSION REDUCTION FOR KERNEL CLUSTERING

We present the formal statement for Theorem 1. Makarychev et al. 2019, Theorem 3.6) . For kernel k-clustering problem with ℓ p 2 -objective whose kernel function K : R d × R d → R is shift-invariant and analytic at the origin, for every data set P ⊂ R d , the RFF mapping π : c i ∥ p 2 . Theorem 3.1 (Generalization of R d → R D with target dimension D = Ω(p 2 log 3 k α + p 5 log 3 1 ε + p 8 )/ε 2 satisfies Pr[∀k-partition C of P : cost π p (P, C) ∈ (1 ± ε) • cost φ p (P, C)] ≥ 1 -δ. Proof. The proof can be found in Section D.

4. LOWER BOUNDS

Theorem 4.1. Consider ∆ ≥ ρ > 0, and a shift-invariant kernel function K : R d → R, denoting its feature mapping φ : R d → H. Then there exists x, y ∈ R d that are (∆, ρ)-bounded, such that for every 0 < ε < 1, the RFF mapping π for K with target dimension D satisfies Pr[| dist φ (x, y) -dist π (x, y)| ≥ ε • dist φ (x, y)] ≥ 2 √ 2π ∞ 6ε D/s (∆,ρ) K e -s 2 /2 ds -O D -1 2 (2) where s (∆,ρ) K := sup (∆, ρ)-bounded x∈R d s K (x), and s K (x) := 1 + K(2x) -2K(x) 2 2(1 -K(x)) 2 . Proof. The proof can be found in Section E. Note that the right hand side of ( 2) is always less than 1, since the first term 2 √ 2π ∞ 6ε D/s (∆,ρ) K e -s 2 /2 ds achieves its maximum at Dε 2 s (∆,ρ) K = 0, and this maximum is 1. On the other hand, we need the right hand side of (2) to be > 0 in order to obtain a useful lower bound, and a typical setup to achieve this is when D = Θ s (∆,ρ) K . Intuition of s K . Observe that s K (x) measures the ratio between the variance of RFF and the (squared) expectation evaluated at x. The intuition of considering this comes from the central limit theorem. Indeed, when the number of samples/target dimension is sufficiently large, the error/difference behaves like a Gaussian distribution where with constant probability the error ≈ Var. Hence, this s K measures the "typical" relative error when the target dimension is sufficiently large, and an upper bound of s (∆,ρ) K is naturally a necessary condition for the bounded relative error. The following gives a simple (sufficient) condition for kernels that do not have a bounded s K (x). Remark 4.1 (Simple sufficient conditions for lower bounds). Assume the input dimension is 1, so K : R → R, and assume ∆ = 1, ρ < 1. Then the (∆, ρ)-bounded property simply requires ρ ≤ |x| ≤ 1. We claim that, if K's first derivative at 0 is non-zero, i.e., K ′ (0) ̸ = 0, then RFF cannot preserve relative error for such K. To see this, we use Taylor's expansion for K at the origin, and simply use the approximation to degree one, i.e., K(x) ≈ 1 + ax (noting that x ≤ 1 so this is a good approximation), where a = K ′ (0). Then s K (x) = 1 + 1 + 2ax -2(1 + ax) 2 2a 2 x 2 = -1 - 1 ax . So if a = K ′ (0) ̸ = 0, then for sufficiently small ρ and |x| ≥ ρ, s K (ρ) ≥ Ω(1/ρ). This also implies the claim in Theorem 1.1 for Laplacian kernels (even though one needs to slightly modify this analysis since strictly speaking K ′ is not well defined at 0 for Laplacian kernels). As a sanity check, for shift-invariant kernels that are analytic at the origin (which include Gaussian kernels), it is necessary that K ′ (0) = 0.

5. BEYOND RFF: OBLIVIOUS EMBEDDING FOR LAPLACIAN KERNEL

In this section we provide a proof sketch for theorem 1.4. A more detailed proof is deferred to section F. Embedding To handle a Laplacian kernel function K(x, y) = e -∥x-y∥ 1 c with some constant c, we cannot directly use the RFF mapping ϕ, since our lower bound shows that the output dimension has to be very large when K is not analytical around the origin. To overcome this issue, we come up with the following idea. Notice that L(x, y) relies on the ℓ 1 -distance between x, y. If one can embed (embedding function f ) the data points from the original ℓ 1 metric space to a new metric space and ensure that there is an kernel function K ′ , analytical around the origin, for the new space s.t. K(x, y) = K ′ (f (x), f (y)) for every pair of original data points x, y, then one can use the function composition ϕ • f to get a desired mapping. Indeed, we find that ℓ 1 can be embedded to ℓ 2 2 isometrically (Kahane, 1981) in the following way. Here for simplicity of exhibition we only handle the case where input data are from N d , upper bounded by a natural number N . Notice that even though input data points are only consisted of integers, the mapping construction needs to handle fractions, as we will later consider some numbers generated from Gaussian distributions or numbers computed in the RFF mapping. So we first setup two numbers, ∆ ′ = poly(N, δ -1 ) large enough and ρ ′ = 1/ poly(N, δ -1 ) small enough. All our following operations are working on numbers that are (∆ ′ , ρ ′ )-bounded. For each dimension we do the following transformation. Let π 1 : N → R N be such that for every x ∈ N, x ≤ N , the first x entries of π 1 (x) is the number 1, while all the remaining entries are 0. Then consider all d dimensions. The embedding function π (d) 1 : N d → R N d be such that for every x ∈ N d , x i ≤ N, ∀i ∈ [d], we have π (d) 1 (x) being the concatenation of d vectors π 1 (x i ), i ∈ [d]. After embedding, consider a new kernel function K ′ = e -∥x ′ -y ′ ∥ 2 2 c , where x ′ = π (d) 1 (x), y ′ = π (d) 1 (y). One can see immediately that K ′ (x ′ , y ′ ) = K(x, y). Hence, we can apply RFF then, i.e. the mapping is ϕ • π (d) 1 , which has a small output dimension. Detailed proofs can be seen in section F.1. However, there is another issue. In our setting, if the data is (∆, ρ) bounded, then we have to pick N = O( ∆ ρ ). The computing time has a linear factor in N , which is too large. Polynomial Time Construction To reduce computing time, we start from the following observation about the RFF mapping ϕ(x ′ ). Each output dimension is actually a function of ⟨ω, x ′ ⟩, where ω is a vector of i.i.d Gaussian random variables. For simplicity of description we only consider that x has only one dimension and x ′ = π 1 (x). So x ′ is just a vector consists of x number of 1's starting from the left and then all the remaining entries are 0's. Notice that a summation of Gaussian random variables is still a Gaussian. So given x, one can generate ⟨ω, x ′ ⟩ according to the summation of Gaussians. But here comes another problem. For two data points x, y, we need to use the same ω. So if we generate ⟨ω, x ′ ⟩ and ⟨ω ′ , y ′ ⟩ separately, then ω, ω ′ are independent. To bypass this issue, first consider the following alternate way to generate ⟨ω, x ′ ⟩. Let h be the smallest integer s.t. N ≤ 2 h . Consider a binary tree where each node has exactly 2 children. The depth is h. So it has exactly 2 h leaf nodes in the last layer. For each node v, we attach a random variable α v in the following way. For the root, we attach a Gaussian variable which is the summation of 2 h independent Gaussian variable with distribution ω 0 . Then we proceed layer by layer from the root to leaves. For each u, v being children of a common parent w, assume that α w is the summation of 2 l independent ω 0 distributions. Then let α u be the summation of the first 2 l-1 distributions among them and α v be the summation of the second 2 l-1 distributions. That is α w = α u + α v with α u , α v being independent. Notice that conditioned on α w = a, then α u takes the value b with probability Pr αu,αv i.i.d. [α u = b | α u + α v = a]. α v takes the value a -b when α u takes value b. The randomness for generating every random variable corresponding to a node, is presented as a sequence, in the order from root to leaves, layer by layer, from left to right. We define α x to be the summation of the random variables corresponding to the first x leaves. Notice that α x can be sampled efficiently in the following way. Consider the path from the root to the x-th leaf. First we sample the root, which can be computed using the corresponding part of the randomness. We use a variable z to record this sample outcome, calling z an accumulator for convenience. Then we visit each node along the path. When visiting v, assume its parent is w, where α w has already been sampled previously with outcome a. If v is a left child of w, then we sample α v conditioned on α w = a. Assume this sampling has outcome b. Then we add -a + b to the current accumulator z. If v is a right child of a node w, then we keep the current accumulator z unchanged. After visiting all nodes in the path, z is the sample outcome for α x . We can show that the joint distribution α x , α y has basically the same distribution as ⟨ω, π 1 (x)⟩, ⟨ω, π 1 (y)⟩. See lemma F.2. The advantage of this alternate construction is that given any x, to generate α x , one only needs to visit the path from the root to the x-th leaf, using the above generating procedure. To finally reduce the time complexity, the last issue is that the uniform random string for generating random variables here is very long. If we sweep the random tape to locate the randomness used to generate a variable corresponding to a node, then we still need a linear time of N . Fortunately, PRGs for space bounded computation, e.g. Nisan's PRG (Nisan, 1992) , can be used here to replace the uniform randomness. Because the whole procedure for deciding whether ∥ϕ • π 1 (x) -ϕ • π 1 (y)∥ 2 approximates 2 -K(x, y) within (1 ± ε) multiplicative error, is in poly-logarithmic space. Also the computation of such PRGs can be highly efficient, i.e. given any index of its output, one can compute that bit in time polynomial of the seed length, which is poly-logarithmic of N . Hence the computing time of the mapping only has a factor poly-logarithmic in N instead of a factor linear in N . Now we have shown our construction for the case that all input data points are from N. One can generalize this to the case where all numbers are (∆, ρ) bounded, by doing some simple roundings and shiftings of numbers. Then this can be further generalized to the case where the input data has d dimension, by simply handling each dimension and then concatenating them together. More details of this part are deferred to section F.4.

6. EXPERIMENTS

We evaluate the empirical relative error of our methods on a simulated dataset. Specifically, we do two experiments, one to evaluate RFF on a Gaussian kernel, and the other one to evaluate the new Figure 1 : The dimension-error tradeoff curves for both experiments, i.e., the experiment that evaluates RFF and the one that evaluates New-Lap. algorithm in Theorem F.1, which we call "New-Lap", on a Laplacian kernel. In each experiment, we compare against two other popular methods, particularly Nyström and random-projection methods. Baselines. Observe that there are many possible implementations of these two methods. However, since we focus on the accuracy evaluation, we choose computationally-heavy but more accurate implementations as the two baselines (hence the evaluation of the error is only in the baseline's favor). In particular, we consider 1) SVD low-rank approximation which we call "SVD", and 2) the vanilla Johnson-Lindenstrauss algorithm performed on top of the high-dimensional representation of points in the feature space, which we call "JL". Note that SVD is the "ideal" goal/form of Nyström methods and that Johnson-Lindenstrauss applied on the feature space can obtain a theoretically-tight target-dimension bound (in the worst-case sense). Experiment Setup. Both experiments are conducted on a synthesized dataset X which consists of N = 100 points with d = 60 dimensions generated i.i.d. from a Gaussian distribution. For the experiment that we evaluate RFF, we use a Gaussian kernel K(x) = exp(-0.5 • ∥x∥ 2 ), and for that we evaluate New-Lap, we use a Laplacian kernel K(x) = exp(-0.5 • ∥x∥ 1 ). In each experiment, for each method, we run it for varying target dimension D (for SVD, D is the target rank), and we report its empirical relative error, which is defined as max x̸ =y∈X |d ′ (x, y) -d K (x, y)| d K (x, y) , where d K is the kernel distance and d ′ is the approximated distance. To make the result stabilized, we conduct this entire experiment for every D for T = 20 times and report the average and 95% confident interval. We plot these dimension-error tradeoff curves, and we depict the results in Figure 1 . Results. We conclude that in both experiments, our methods can indeed well preserve the relative error of the kernel distance, which verifies our theorem. In particular, the dimension-error curve is comparable (and even slightly better) to the computationally heavy Johnson-Lindenstrauss algorithm (which is theoretically optimal in the worst case). On the contrary, the popular Nyström (low-rank approximation) method is largely incapable of preserving the relative error of the kernel distance. In fact, we observe that d ′ SV D (x, y) = 0 or ≈ 0 often happens for some pairs of (x, y) such that d(x, y) ̸ = 0, which explains the high relative error. This indicates that our methods can indeed well preserve the kernel distance in relative error, but existing methods struggle to achieve this. We prove this by induction. In the case of k = 0 the lemma holds obviously. If for k the lemma holds, we have cos k+1 (⟨ω i , x⟩) = cos(⟨ω i , x⟩) • 1 2 k k j=0 k j cos((2j -k)⟨ω i , x⟩) = 1 2 k k j=0 k j cos(⟨ω i , x⟩) cos(2j -k)⟨ω i , x⟩) = 1 2 k+1 k j=0 k j cos((2j -k + 1)⟨ω i , x⟩) + cos((2j -k -1)⟨ω i , x⟩) = 1 2 k+1 k j=0 k j cos((2(j + 1) -(k + 1))⟨ω i , x⟩) + cos(2j -(k + 1))⟨ω i , x⟩) = 1 2 k+1 k+1 j=0 k j + k j -1 cos((2j -(k + 1))⟨ω i , x⟩) = 1 2 k+1 k+1 j=0 k + 1 j cos((2j -(k + 1))⟨ω i , x⟩) where in the third equality we use the fact that 2 cos α cos β = cos(α + β) + cos(α -β). Lemma A.2. If there exists r > 0 such that K is analytic in {x ∈ R d : ∥x∥ 1 < r}, then ∀k ≥ 0, lim x→0 E[Xi(x) k ] ∥x∥ 2k 1 = c for some constant c. Proof. We denote analytic function K(x) as Taylor series around origin as K(x) = β∈N d c β x β , where x β := d i=1 x βi i is a monomial and its coefficient is c β . By definition, c 0 = 1 since K(0) = 1. We let s : N d → N, s(β) := d i=1 β i denote the degree of x β . Since K(x -y) = K(x, y) = K(y, x) = K(y -x) by definition, hence K(x) is an even function, so c β = 0 for s(β) odd. Recall that X i (x) := cos⟨ω i , x⟩ -K(x). In the following, we drop the subscripts in X i , ω i and write X, ω for simplicity. By the definition of X we have E[X(x) k ] = k i=0 k i K(x) i (-1) k-i E[cos k-i ⟨ω, x⟩]. (4) Note that by Lemma A.1, E[cos k-i ⟨ω, x⟩] = 1 2 k-i k-i j=0 k-i j K((2j -(k -i))x). Plug this in eq. ( 4): E[X(x) k ] = k i=0 k i K(x) i (-1) k-i 1 2 k-i k-i j=0 k -i j K((2j -(k -i))x) = k i=0 k i K(x) i -1 2 k-i k-i j=0 k -i j β∈N d c β (2j -k + i) s(β) x β = β∈N d c β x β k i=0 k i K(x) i -1 2 k-i k-i j=0 k -i j (2j -k + i) s(β) where the second equality comes from the Tyler expansion of K((2j -k + i)x). Next we will show that E[X(x) k ] is of degree at least 2k. For β = 0 note that k i=0 k i K(x) i (-1) k-i = (K(x) -1) k , since K(x) is even and K(0) = 1, we have lim x→0 (K(x)-1) k x t = 0, ∀t < 2k . For β ̸ = 0, we next show that every term of degree less than 2k has coefficient zero. Fix β ̸ = 0 and take Tyler expansion for K(x) i K(x) i = β1,β2,...,βi c β1 c β2 . . . c βi x β1+...+βi , Without loss of generality, we assume β l+1 , ..., β i are all βs that equals 0, so we have c β l+1 = ... = c βi = 1. Now we consider the coefficient of term c β c β1 c β2 . . . , c β l x β+ l j=1 βj , which would be: C k i=0 k i i l -1 2 k-i k-i j=0 k -i j (2j -k + i) s(β) where C is the number of ordered sequence (β 1 , β 2 , . . . , β l ), here, for β 1 = β 2 , (β 1 , β 2 , . . . , β l ) and (β 2 , β 1 , . . . , β l ) are equivalent. Next, we show if the degree of a monomial s(β) + l j=1 s(β j ) < 2k, its coefficient is zero. Since all β j ̸ = 0, we may assume s(β j ) ≥ 2, therefore s(β) < 2k -2l. Suppose operator J is a mapping from a function space to itself, such that ∀f : R → R, J(f ) : R → R is defined by J(f )(x) := f (x + 1) . Denote J 1 = J, J k := J • J k-1 as its k-time composition, define J 0 to be the identity mapping such that J 0 (f ) = f . Similarly we can define addition that β) , the coefficient can be rewritten as: (J 1 + J 2 )(f ) = J 1 (f ) + J 2 (f ) and scalar multiplication that (αJ)(f ) = α(J(f )). By definition, cJ m • J n = J m • (cJ n ) = cJ m+n , ∀c ∈ R, m, n ∈ N. Let L(x) = x s( C k i=0 k i i l -1 2 k-i k-i j=0 k -i j J 2j+i (L)(-k) Let P = C k i=0 k i i l -1 2 k-i k-i k-i j J 2j+i (L), the above is P (-k). Now we show P ≡ 0 P = C   k i=0 k i i l -1 2 k-i J i •   k-i j=0 k -i j J 2j     (L) = C k i=l k l k -l i -l J i • - J 0 + J 2 2 k-i (L) = CJ l • k l k i=l k -l i -l J i-l • - J 0 + J 2 2 k-i (L) = CJ l • k l J - J 0 + J 2 2 k-l (L). Note that J -J 0 +J 2 2 (f )(x) = (f (x + 1) -f (x))/2 -(f (x + 2) -f (x + 1))/2 calculates second order difference, namely, ∀f that is a polynomial of degree k ≥ 2, J -J 0 +J 2 2 (f ) is a polynomial of degree k -2, and ∀f that is a polynomial of degree k < 2, J -J 0 +J 2 2 (f ) is 0. Since L is a polynomial of degree less than 2(k -l), we have J - J 0 + J 2 2 k-l (L) ≡ 0. Combining the above two cases, we have proved eq. ( 4) is of degree at least 2k, which completes our proof. Proof of Lemma 3.1. If 2k∥x∥ 1 ≥ r, since |X i (x)| = | cos⟨ω i , x⟩ -K(x)| ≤ 2, we have E[X i (x) k ] ≤ 2 k ≤ 4k∥x∥ 1 r 2k . Otherwise ∥x∥ 1 < r/2k. Define g k (x) := E[X i (x) k ], we have: g k (x) = ∞ i=0   d j=1 x j ∂ ∂x j   i g k (x) i! =   d j=1 x j ∂ ∂x j   2k g k (θx) (2k)! , θ ∈ [0, 1] where the second equation comes from Lemma A.2 and Taylor expansion with Lagrange remainder. Lemma A.3 (Cauchy's integral formula for multivariate functions Hormander 1966) . For f (z 1 , ..., z d ) analytic in ∆(z, r) = ζ = (ζ 1 , ζ 2 , . . . , ζ d ) ∈ C d ; |ζ ν -z ν | ≤ r ν , ν = 1, . . . , d f (z 1 , . . . , z d ) = 1 (2πi) d ∂D1×∂D2×•••×∂D d f (ζ 1 , . . . , ζ d ) (ζ 1 -z 1 ) • • • (ζ d -z d ) dζ. Furthermore, ∂ k1+•••+k d f (z 1 , z 2 , . . . , z d ) ∂z 1 k1 • • • ∂z d k d = k 1 ! • • • k d ! (2πi) d ∂D1×∂D2•••×∂D d f (ζ 1 , . . . , ζ d ) (ζ 1 -z 1 ) k1+1 • • • (ζ d -z d ) k d +1 dζ. If in addition |f | < M , we have the following evaluation: ∂ k1+•••+k d f (z 1 , z 2 , . . . , z d ) ∂z 1 k1 • • • ∂z d k d ≤ M k 1 ! • • • k d ! r 1 k1 • • • r d k d . Recall that g k (x) = E[X i (x) k ] = k i=0 k i K(x) i (-1) k-i 1 2 k-i k-i j=0 k-i j K((2j -(k -i)) x), so g k (x) = poly(K(x), K(-x), . . . , K(kx), K(-kx)) is analytic when ∥x∥ 1 ≤ r/k. Applying Cauchy's integral formula Lemma A.3 (here ∥z + θx∥ 1 ≤ 2 • r/2k is in the analytic area), g k (x) = t1+•••+t d =2k x t1 1 x t2 2 . . . x t d d t 1 !t 2 ! . . . t d ! ∂ 2k g k (θx) ∂x t1 1 ∂x t2 2 . . . ∂x t d d = t1+•••+t d =2k x t1 1 x t2 2 . . . x t d d (2πi) d z∈C d ,|zi|= r 2k g k (z + θx) z t1+1 1 . . . z t d +1 d dz we have |g k (x)| ≤ sup |zi|=r/2k |g k (z + θx)| 2k r 2k d i=1 x i 2k ≤ 4k∥x∥ 1 r 2k . B PROOF OF LEMMA 3.2 Lemma 3.2. For kernel K which is shift-invariant and analytic at the origin, there exist c K , R K > 0 such that for all ∥x∥ 1 ≤ R K , 1-K(x) ∥x∥ 2 1 ≥ c K 2 . Proof. It suffices to prove that lim inf x→0 1-K(x) ∥x∥ 2 1 ≥ c > 0, for some c. Towards proving this, we show that K is strongly convex at origin. In fact, by definition, ℜ R d p(ω)e i⟨ω,tx⟩ dω = K(tx) for every fixed x, therefore K ′′ (tx) = ℜ R d ∥ω∥ 2 p(ω)e i⟨ω,tx⟩ dω > 0, hence K(tx) is strongly convex with respect to t at origin, so is K(x).

C PROOF OF THEOREM 1.2

Theorem 1.2 (Upper bound). Let K : R d × R d → R be a kernel function which is shift-invariant and analytic at the origin, with feature mapping φ : R d → H for some feature space H. For every 0 < δ ≤ ε ≤ 2 -16 , every d, D ∈ N, D ≥ max{Θ(ε -1 log 3 (1/δ)), Θ(ε -2 log(1/δ))}, if π : R d → R D is an RFF mapping for K with target dimension D, then for every x, y ∈ R d , Pr[| dist π (x, y) -dist φ (x, y)| ≤ ε • dist φ (x, y)] ≥ 1 -δ. Proof. When ∥x -y∥ 1 ≥ R K , consider the function g(t) = K(t(x -y)). It follows from definition that g ′ (0) = 0, g ′′ (t) = -ℜ R d ∥ω∥ 2 ∥x -y∥ 2 p(ω)e i⟨ω,t(x-y)⟩ dω < 0, so g(t) strictly decreases for all t > 0. So 2 -2K(x -y) ≥ 2 -2 max ∥x-y∥1=R K K(x -y) > 0. We denote t = 2 -2 max ∥x-y∥1=R K K(x -y), so by Chernorff bound, when D ≥ 1 2t (ln 1 δ + ln 2), we have: Pr 2 D D i=1 X i (x -y) ≤ ε • (2 -2K(x -y)) ≥ Pr 2 D D i=1 X i (x -y) ≤ t ≥ 1 -δ. When ∥x -y∥ 1 ≤ R K , by Lemma 3.2, we have 2 -2K(x -y) ≥ c∥x -y∥ 2 1 . Then we have: Pr 2 D D i=1 X i (x -y) ≤ ε • (2 -2K(x -y)) ≥ Pr 2 D D i=1 X i (x -y) ≤ cε • ∥x -y∥ 2 1 . We take r = r K for simplicity of exhibition. Assume δ ≤ min{ε, 2 -16 }, let k = log(2D 2 /δ), t = 64k 2 /r 2 , is even. By Markov's inequality and Lemma 3.1: Pr[|X i (x -y)| ≥ t∥x -y∥ 2 1 ] = Pr |X i (x -y)| k ≥ t k ∥x -y∥ 2k 1 ≤ (4k) 2k t k r 2k = 4 -k ≤ δ 2D 2 . For simplicity denote X i (x -y) by X i , ∥x -y∥ 2 1 by ℓ and define X ′ i = 1 [|Xi|≥tℓ] tℓ • sgn(X i ) + 1 [|Xi|<tℓ] X i , note that E[X i ] = 0. Then: |E[X ′ i ]| ≤ |E[X ′ i | |X ′ i | < tℓ]| • Pr[|X ′ i | < tℓ] + tℓ • |Pr[X ′ i ≥ tℓ] -Pr[X ′ i ≤ -tℓ]| = |E[X ′ i | |X ′ i | < tℓ]| • Pr[|X i | < tℓ] + tℓ • |Pr[X i ≥ tℓ] -Pr[X i ≤ -tℓ]| = |E[X i ] -E [X i | |X i | ≥ tℓ] Pr[|X i | ≥ tℓ]| Pr[|X i | < tℓ] Pr[|X i | < tℓ] + tℓ • |Pr[X i ≥ tℓ] -Pr[X i ≤ -tℓ]| = |E[X i ] -E [X i | |X i | ≥ tℓ] Pr[|X i | ≥ tℓ]| + tℓ • |Pr[X i ≥ tℓ] -Pr[X i ≤ -tℓ]| = |E [X i | |X i | ≥ tℓ] Pr[|X i | ≥ tℓ]| + tℓ • |Pr[X i ≥ tℓ] -Pr[X i ≤ -tℓ]| where tℓ • | Pr[X i > tℓ] -Pr[X i < -tℓ]| ≤ tℓ • Pr[|X i | > tℓ] ≤ tℓδ/(2D 2 ). The first inequality is by considering the two conditions |X i | < tℓ and |X i | ≥ tℓ, then taking a triangle inequality. The first and second equations are by definition of X i , X ′ i . The third equation is a straightforward computation. The last equation is due to E[X i ] = 0. By Lemma 3.1 for every integer α, Pr |X i | ≥ αℓ/r 2 = Pr |X i | √ α/8 ≥ (αℓ/r 2 ) √ α/8 ≤ E[|X i | √ α/8 ] (αℓ/r 2 ) √ α/8 ≤ 4 - √ α/8 . The first equality is straightforward. The first inequality is by Markov. The second equality is by E[|X i | √ α/8 ] ≤ αℓ 4r 2 √ α/8 which follows from Lemma 3.1, and a rearrangement of parameters, where r is the parameter r in Lemma 3.1. Therefore, |E[X i | |X i | ≥ tℓ]| Pr[|X i | ≥ tℓ] ≤ E[|X i | | |X i | ≥ tℓ] Pr[|X i | ≥ tℓ] ≤ (t + 1 r 2 )ℓ • Pr[|X i | ≥ tℓ] + ℓ r 2 integer α≥tr 2 +1 Pr[|X i | ≥ αℓ/r 2 ] ≤ (t + 1 r 2 )ℓ • δ 2D 2 + ℓ ∞ tr 2 4 - √ α/8 dα ≤ ℓ (t + 1 r 2 ) δ 2D 2 + 16 r 2 ln 4 4 -tr 2 /8 . The first inequality is by the property of absolute value. The second inequality is because we can divide the event By plugging in parameters t, δ, D ≥ max{Θ(ε |X i | ≥ tℓ into |X i | ∈ [αℓ/r 2 , (α + 1)ℓ/r 2 ), α = tr 2 , tr 2 + 1, . . . and when |X i | ∈ [αℓ/r 2 , (α + 1)ℓ/r 2 ), |X i | < (α + 1)ℓ/r 2 . -1 log 3 (1/δ)), Θ(ε -2 log(1/δ))}, we have E[|X ′ i |] ≤ δℓ. Note that the Θ(D) hides a constant r. Denote σ ′2 as the variance of X ′ i . So σ ′ ≤ 64ℓ/r 2 by Lemma 3.1. Lemma C.1 (Bernstein's Inequality). Let X 1 , .., X D be independent zero-mean random variables. Suppose that |X i | ≤ M, ∀i, then for all positive t, Pr D i=1 X i ≥ t ≤ exp - t 2 /2 M t/3 + D i=1 E[X 2 i ] . Applying Bernstein's Inequality to X ′ i , Pr D i=1 X ′ i -DE[X ′ i ] ≥ (cεℓ/σ ′ )Dσ ′ ≤ exp - c 2 ε 2 D ctε + 2σ ′2 /ℓ 2 ≤ max exp - cε 2 D 2tε , exp - c 2 ε 2 D 4σ ′2 /ℓ 2 . Since D ≥ max{Θ tε -1 log(1/δ) , Θ ε -2 log(1/δ) }, we have Pr D i=1 X ′ i ≥ ε(ℓ/σ ′ )Dσ ′ ≤ δ/2. With 1 -δ 2 probability, every X i ≤ tℓ, X ′ i = X i . Therefore, Pr D i=1 X i ≥ D(δ + ε)ℓ ≤ δ/2. Combine it together, Pr[| dist π (x, y) -dist φ (x, y)| ≤ ε • dist φ (x, y)] ≥ 1 -δ.

D PROOF OF THEOREM 3.1

Theorem 3.1 (Generalization of Makarychev et al. 2019, Theorem 3.6) . For kernel k-clustering problem with ℓ p 2 -objective whose kernel function K : R d × R d → R is shift-invariant and analytic at the origin, for every data set P ⊂ R d , the RFF mapping π : R d → R D with target dimension D = Ω(p 2 log 3 k α + p 5 log 3 1 ε + p 8 )/ε 2 satisfies Pr[∀k-partition C of P : cost π p (P, C) ∈ (1 ± ε) • cost φ p (P, C)] ≥ 1 -δ. The proof relies on a key notion of (ε, δ, ρ)-dimension reduction from (Makarychev et al., 2019) , and we adopt it with respect to our setting/language of kernel distance as follows. Definition D.1 (Makarychev et al. 2019 , Definition 2.1). For ε, δ, ρ > 0, a feature mapping φ : R d → H for some Hilbert space H, a random mapping y ) with probability at least 1 -δ, and π d,D : R d → R D is an (ε, δ, ρ)-dimension reduction, if • for every x, y ∈ R d , 1 1+ε dist φ (x, y) ≤ dist π (x, y) ≤ (1 + ε) dist φ (x, • for every fixed p ∈ [1, ∞), E 1 {distπ(x,y)>(1+ε) distφ(x,y)} distπ(x,y) p distφ(x,y) p -(1 + ε) p ≤ ρ. In Makarychev et al. (2019) , most results are stated for a particular parameter setup of Definition D.1 resulted from Johnson-Lindenstrauss transform (Johnson & Lindenstrauss, 1984) , but their analysis actually works for other similar parameter setups. The following is a generalized statement of (Makarychev et al., 2019, Theorem 3.5 ) which also reveals how alternative parameter setups affect the distortion. We note that this is simply a more precise and detailed statement of (Makarychev et al., 2019, Theorem 3.5) , and it follows from exactly the same proof in Makarychev et al. (2019) . Lemma D.1 (Makarychev et al. 2019, Theorem 3.5 ). Let 0 < ε, α < 1 and θ := min{ε p+1 3 -(p+1)(p+2) , αε p /(10k(1 + ε) 4p-1 ), 1/10 p+1 }. If some (ε, δ, ρ)-dimension reduction π for feature mapping φ : R d → H of some kernel function satisfies δ ≤ min(θ 7 /600, θ/k), k 2 δ ≤ α 2 , ρ ≤ θ, then with probability at least 1 -α, for every partition C of P , cost π p (P, C) ≤ (1 + ε) 3p cost φ p (P, C), (1 -ε) cost φ p (P, C) ≤ (1 + ε) 3p-1 cost π p (P, C). Proof of Theorem 3.1. We verify that setting D = Θ(log 3 k α + p 3 log 3 1 ε + p 6 )/ε 2 , the RFF mapping π with target dimension D satisfies the conditions in Lemma D.1, namely, it is a (ε, δ, ρ)dimension reduction . In fact, Theorem 1.2 already implies such π satisfies that for every x, y ∈ R d , 1 1+ε dist φ (x, y) ≤ dist π (x, y) ≤ (1 + ε) dist φ (x, y) with probability at least 1 -δ, where δ = e -cf (ε,D) for some constant c, and f (ε, D) := max{ε 2 D, ε 1/3 D 1/3 }. For the other part, D) dt. E 1 {distπ(x,y)>(1+ε) distφ(x,y)} dist π (x, y) p dist φ (x, y) p -(1 + ε) p = ∞ ε ((1 + t) p -(1 + ε) p ) d -Pr dist π (x, y) dist φ (x, y) > t + 1 = [-(1 + m) p + (1 + ε) p ] Pr dist π (x, y) dist φ (x, y) > m + 1 m=+∞ m=ε + ∞ ε p(1 + t) p-1 Pr dist π (x, y) dist φ (x, y) > t + 1 dt (integration by part) = ∞ ε p(1 + t) p-1 Pr dist π (x, y) dist φ (x, y) > t + 1 dt ≤ ∞ ε p(1 + t) p-1 e -cf (t, Where the third equality follows by Pr distπ(x,y) distφ(x,y) > m decays exponentially fast with respect to m. Observe that for p ≥ 1, D ≥ (p-1) 3 8c 3 , p(1 + t) p-1 e -cD 1 3 t 1 3 /2 decrease when t ≥ ε, and for D ≥ c(p-1) ε 2 , p(1 + t) p-1 e -ct 2 D/2 decrease when t ≥ ε. Hence for D ≥ max{ (p-1) 3 8c 3 , c(p-1) ε 2 }, we have ∞ ε p(1 + t) p-1 e -cf (t,D) dt ≤ c ′ ∞ ε e -cf (t,D)/2 dt < c ′′ e -cf (ε,D)/2 . In conclusion, by setting D = Θ(log 3 k α + p 3 log 3 1 ε + p 6 )/ε 2 , for δ = e -cf (ε,D) , ρ = c ′′ e -cf (ε,D) and f (ε, D) = max{ε 2 D, ε 1/3 D 1/3 }, it satisfies δ ≤ min(θ 7 /600, θ/k), k 2 δ ≤ α 2 , ρ ≤ θ. This verifies the condition of Lemma D.1. Finally, we conclude the proof of Theorem 3.1 by plugging ε ′ = ε/3p and the above mentioned RFF mapping π with target dimension D into Lemma D.1.

E PROOF OF THEOREM 4.1

Theorem 4.1. Consider ∆ ≥ ρ > 0, and a shift-invariant kernel function K : R d → R, denoting its feature mapping φ : R d → H. Then there exists x, y ∈ R d that are (∆, ρ)-bounded, such that for every 0 < ε < 1, the RFF mapping π for K with target dimension D satisfies Pr[| dist φ (x, y) -dist π (x, y)| ≥ ε • dist φ (x, y)] ≥ 2 √ 2π ∞ 6ε D/s (∆,ρ) K e -s 2 /2 ds -O D -1 2 (2) where s (∆,ρ) K := sup (∆, ρ)-bounded x∈R d s K (x), and s K (x) := 1 + K(2x) -2K(x) 2 2(1 -K(x)) 2 . Proof. Our proof requires the following anti-concentration inequality. Lemma E.1 (Berry 1941; Esseen 1942) . For i.i.d. random variables ξ i ∈ R with mean 0 and variance 1, let X := 1 √ D D i=1 ξ i , then for any t, Pr[X ≥ t] ≥ 1 √ 2π ∞ t e -s 2 /2 ds -O(D -1 2 ) Let X i (x) := cos⟨ω i , x⟩ -K(x), σ(x) := Var(X i (x)) = 1+K(2(x))-2K(x) 2 2 , choose x, y such that s K (x -y) = s (∆,ρ) K . Clearly, such pair of x, y satisfies that (x -y) is (∆, ρ)-bounded. In fact, it is without loss of generality to assume that both x and y are (∆, ρ)-bounded, since one may pick y ′ = 0, x ′ = x -y and still have x ′ -y ′ = x -y. We next verify that such x, y satisfy our claimed properties. Indeed, Pr[| dist φ (x, y) -dist π (x, y)| ≥ ε • dist φ (x, y)] ≥ Pr[| dist φ (x, y) 2 -dist π (x, y) 2 | ≥ 6ε • dist φ (x, y) 2 ] = Pr 2 D D i=1 X i (x -y) ≥ 6ε(2 -2K(x -y)) = Pr 1 √ D • σ(x -y) D i=1 X i (x -y) ≥ 6ε(1 -K(x -y)) • √ D σ(x -y) ≥ -O(D -1/2 ) + 2 √ 2π ∞ 6ε(1-K(x-y)) √ D/σ(x-y) e -s 2 /2 ds = -O(D -1/2 ) + 2 √ 2π ∞ 6ε D/s (∆,ρ) K e -s 2 /2 ds, where the second inequality is by Lemma E.1, and the the second-last equality follows from the definition of s K (•), and that of x, y such that s K (x -y) = s (∆,ρ) K .

F BEYOND RFF: OBLIVIOUS EMBEDDING FOR LAPLACIAN KERNEL WITH SMALL COMPUTING TIME

In this section we show an oblivious feature mapping for Laplacian kernel dimension reduction with small computing time. The following is the main theorem. Theorem F.1. Let K be a Laplacian kernel with feature mapping φ : R d → H. For every 0 < δ ≤ ε ≤ 2 -16 , every d, D ∈ N, D ≥ max{Θ(ε -1 log 3 (1/δ)), Θ(ε -2 log(1/δ))}, every ∆ ≥ ρ > 0, there is a mapping π : R d → R D , such that for every x, y ∈ R d that are (∆, ρ)-bounded, Pr[| dist π (x, y) -dist φ (x, y)| ≤ ε • dist φ (x, y)] ≥ 1 -δ. The time for evaluating π is dD poly(log ∆ ρ , log δ -1 ). For simplicity of exhibition, we first handle the case when the input data are from N d . At the end we will describe how to handle the case when the input data are from R d by a simple transformation. Let N ∈ N be s.t. every entry of an input data point is at most N . Even though input data are only consisted of integers, the mapping construction needs to handle fractions, as we will later consider some numbers generated from Gaussian distributions or numbers computed in the RFF mapping. So we first setup two numbers, ∆ ′ = poly(N, δ -1 ) large enough and ρ ′ = 1/ poly(N, δ -1 ) small enough. All our following operations are working on numbers that are (∆ ′ , ρ ′ )-bounded. Denote ρ ′ /∆ ′ as ρ 0 for convenience. F.1 EMBEDDING FROM ℓ 1 TO ℓ 2 2 Now we describe an isometric embedding from ℓ 1 norm to ℓ 2 2 . This construction is based on Kahane (1981) , in which the first such construction of finite dimension was given, to the best of our knowledge. Let π 1 : N → R N be such that for every x ∈ N, x ≤ N , π 1 (x)[j] = 1, if j ∈ [1, x] and π 1 (x)[j] = 0 otherwise. Let π (d) 1 : N d → R N d be such that for every x ∈ N d , x i ≤ N, ∀i ∈ [d], we have π (d) 1 (x) being the concatenation of d vectors π 1 (x i ), i ∈ [d]. Lemma F.1. For every x, y ∈ N d with x i , y i ≤ N, i ∈ [d], it holds that ∥x -y∥ 1 = ∥π (d) 1 (x) -π (d) 1 (y)∥ 2 2 . Proof. Notice that for every i ∈ [d], π 1 (x i ) has its first x i entries being 1 while π 1 (y i ) has its first y i entries being 1. Thus ∥π 1 (x i ) -π 1 (y i )∥ 2 2 is exactly ∥x i -y i ∥ 1 . If we consider all the d dimensions, then by the construction of π 2 ) which is actually a Gaussian kernel. This gives a mapping which preserves kernel distance for Laplacian kernel. To be more precise, we setup the mapping to be π = ϕ • π (d) 1 . The only drawback is that the running time is high, as in the above mapping we map d dimension to dN dimension. We formalize this as the following theorem. Theorem F.2. Let K be a Laplacian kernel with feature map φ : R d → H. For every 0 < δ ≤ ε ≤ 2 -16 , every d, D, N ∈ N, D ≥ max{Θ(ε -1 log 3 (1/δ)), Θ(ε -2 log(1/δ))}, there exists a mapping π : R d → R D s.t. for every x, y ∈ N d , x, y ≤ N , Pr[| dist π (x, y) -dist φ (x, y)| ≤ ε • dist φ (x, y)] ≥ 1 -δ. The time of evaluating π is Õ(dDN ). Proof. Consider the map π defined above. It follows by Lemma F.1 and Theorem 1.2. The running time is as stated, since we need to compute a vector of length O(N ) and then apply the RFF on this vector. The time complexity is a bit large, since we need to compute π (d) 1 which has a rather large dimension dN . Next we show how to reduce the time complexity.

F.3 AN ALTERNATE CONSTRUCTION

We give the following map π ′ which has the same output distribution as that of π = ϕ • π (d) 1 . Then in the next subsection we will use pseudorandom generators to replace the randomness in π ′ while using its highly efficiency in computation to reduce the time complexity. Notice that in computing ϕ • π (d) 1 , for each output dimension, the crucial step is computing ⟨ω, π (d) 1 (x)⟩ for some Gaussian distribution ω ∈ R dN which has each dimension being an independent gaussian distribution ω 0 . The final output is a function of ⟨ω, π (d) 1 (x)⟩. So we only need to present the construction for the first part, i.e. the inner product of an N dimension Gaussian distribution and π 1 (x 1 ). For the other parts the computations are the same and finally we only need to sum them up. Hence to make the description simpler, we denote this inner product as ⟨ω, π 1 (x)⟩, where now we let x ∈ N, x ≤ N and ω has N dimensions each being an independent ω 0 . Let h be the smallest integer s.t. N ≤ 2 h . Consider a binary tree where each node has exactly 2 children. The depth is h. So it has exactly 2 h ≥ N leaf nodes in the last layer. For each node v, we attach a random variable α v in the following way. For the root, we attach a Gaussian variable which is the summation of 2 h independent Gaussian variable with distribution ω 0 . Then we proceed layer by layer from the root to leaves. For each u, v being children of a common parent w, assume that α w is the summation of 2 l independent ω 0 distributions. Then let α u be the summation of the first 2 l-1 distributions among them and α v be the summation of the second 2 l-1 distributions. That is α w = α u + α v with α u , α v being independent. Notice that conditioned on α w = a, then α u takes the value b with probability Pr αu,αv i. The randomness for generating every random variable corresponding to a node, are presented as a sequence, in the order from root to leaves, layer by layer, from left to right. We define α x to be The running time is computed as the following. We only need to consider one dimension of the input data and one output dimension of the mapping, since others can be computed using the same time. So actually we consider the time for sampling α x . For α x , recall that we visit the path from the root to the x-th leaf. We don't have to compute the whole output of G, but instead only need to use some parts of the output. For sampling each variable α v along the path, we use τ bits in the output of G. By Theorem F.3, the computing of each random bit in G's output, given the index of this bit, needs time poly(r). Locating the τ bits of randomness for generating α v needs time O(log N ). Generating each of the Gaussian random variable using these random bits needs time t τ . Summing up these variables takes less time than sampling all of them. After sampling, the cosine and sine function of the RFF can be computed in time poly(1/ρ 0 ) = poly(log N, δ -foot_0 ). There are d input dimensions and D output dimensions. So the total time complexity is dD poly(log N, δ -1 ). For the case that x, y ∈ R d , we only need to modify the embedding π (d) 1 in the following way. We first round every entry so that their decimal part is now finite. The rounded parts are small enough (e.g. dropping all digits after the 10 log ρ -1 -th position to the right of the decimal point.) such that this only introduce some small additive errors. Then we shift all the entries to be non-negative numbers by adding a common shift s. Then we multiply every entry of x by a common factor t s.t. every entry now only has an integer part. Notice that t and s can both be chosen according to ∆ ρ , for example t = s = O( ∆ ρ ). And we can take N to be poly( ∆ ρ ). Then we apply π 1 , and multiply a factor 1/t. Denote this map as π1 . Notice that this ensures that ∥x -y∥ 1 = ∥π (d) 1 (x) -π(d) 1 (y)∥ 2 2 . Then we can apply the same construction and analysis as we did for the above natural number case. This shows the theorem. G REMARKS AND COMPARISONS TO CHEN & PHILLIPS (2017) Our upper bound in Theorem 1.2 is not directly comparable to that of Chen & Phillips (2017) which gave dimension reduction results for Gaussian kernels. Chen & Phillips (2017) showed in their Theorem 7 a slightly improved target dimension bound than ours, but it only works for the case of ∥x -y∥ ≥ σ, where σ is the parameter in the Gaussian kernel 1 . For the other case of ∥x -y∥ < σ, their Theorem 14 gave a related bound, but their guarantee is quite different from ours. Specifically, their target dimension depends linearly on the input dimension d. Hence, when d is large (e.g., d = log 2 n), this Theorem 14 is worse than ours (for the case of ∥x -y∥ < σ. Finally, we remark that there might be subtle technical issues in the proof of [CP17] . Their Theorem 7 crucially uses a bound for moment generating functions that is established in their Lemma 5. However, we find various technical issues in the proof of Lemma 5 (found in their appendix). Specifically, the term E[e -s 1 2 ω 2 ∥∆∥ 2 ] in the last line above "But" (in page 17), should actually be E[e s 1 2 ω 2 ∥∆∥ 2 ]. Even if one fixes this mistake (by negating the exponent), then eventually we can only obtain a weaker bound of ln M (s) ≤ s 2 4 ∥∆∥ 4 + s∥∆∥ 2 in the very last step, since the term -s∥∆∥ 2 is negated accordingly. Hence, it is not clear if the claimed bound can still be obtained in Theorem 7.



This condition is not clearly mentioned in the theorem statement, but it is indeed used, and is mentioned in one line above the statement inChen & Phillips (2017).



3 in Theorem 3.1. In fact, we consider the more general k-clustering problem with ℓ p 2 -objective defined in the following Definition 3.1, which generalizes kernel k-means (by setting p = 2). Definition 3.1. Given a data set P ⊂ R d and kernel function K : R d × R d → R, denoting the feature mapping as φ : R d → H, the kernel k-clustering problem with ℓ p -objective asks for a k-partition C = {C 1 , C 2 , ..., C k } of P that minimizes the cost function: cost φ p (P, C) := k i=1 min ci∈H x∈Ci ∥φ(x) -

MAPPING FOR LAPLACIAN KERNEL Notice that we can apply the mapping π (d) 1 and the RFF mapping ϕ from Theorem 1.2 for the kernel function exp(-∥x-y∥ 2

i.d. [α u = b | α u + α v = a].α v takes the value a -b when α u takes value b.

The third inequality is by pluging in the previous bound for Pr[|X i | ≥ αℓ/r 2 ]. The last inequality is by a calculation of the integral.

ACKNOWLEDGMENTS

Research is partially supported by a national key R&D program of China No. 2021YFA1000900, startup funds from Peking University, and the Advanced Institute of Information Technology, Peking University.

Appendices

A PROOF OF LEMMA 3.1 Lemma 3.1. If for some r > 0, K is analytic in {x ∈ R d : ∥x∥ 1 < r}, then for every k ≥ 1 being even and every x s.t. ∥x∥ 1 < r, we haveWe first introduce following two lemmas to show the properties of E[X i (x) k ]. Lemma A.1. For any ω i sampled in RFF and k ≥ 0, we haveProof. By eq. ( 1) it is sufficient to prove thatthe summation of the random variables corresponding to the first x leaves. Notice that α x can be sampled efficiently in the following way. Consider the path from the root to the x-th leaf. First we sample the root, which can be computed using the corresponding randomness. We use a variable z to record this sample outcome, calling z an accumulator for convenience. Then we visit each node along the path. When visiting v, assume its parent is w, where α w has already been sampled previously with outcome a. If v is a left child of w, then we sample α v conditioned on α w = a.Assume this sampling has outcome b. Then we add -a + b to the current accumulator z. If v is a right child of a node w, then we keep the current accumulator z unchanged. After visiting all nodes in the path, z is the sample outcome for α x . Lemma F.2. The joint distribution α x , x = 0, 1, . . . , N has the same distribution as ⟨ω, π 1 (x)⟩, x = 0, 1, . . . , N .Proof. According to our construction, each leaf is an independent distribution ω 0 . Hence if we take all the leaves and form a vector, then it has the same distribution as w.Notice that for each parent w with two children u, v, by the construction, α w = α u + α v . Here α u , α v are independent, each being a summation of l independent ω 0 , with l being the number of leaves derived from u. Thus for each layer, for every node u in the layer, α u 's are independent and the summation of them is their parent. So for the last layer all the variables are independent and follow the distribution ω 0 . And for each node w in the tree, α w is the summation of the random variables attached to the leaves of the subtree whose root is w. So α x is the summation of the first x leaf variables.We do the same operation for other dimensions of the output of π1 and then sum them up to get an alternate constructionWe note that to generate an α v , we only need to simulate the conditional distributions. The distribution function F of the random variable is easy to derive, since its density function is a product of three Gaussian density functions, i.e.where α u , α v are Gaussians. To compute F we can use the taylor expansion of its density function to get an analytical form of F , and the evaluation then can be computed in time t τ = poly(ρ -1 0 ). Recall that ρ 0 is defined to be ρ ′ /∆ ′ . To sample α u , we use τ = O(log ρ -1 0 ) uniform random bits to generate a number p uniformly with precision poly(ρ -1 0 ) small enough. Then we use binary search to figure out an b such that F (b) ∈ [p -ε 0 , p + ε 0 ], for some small enough ε 0 = poly(ρ 0 ). and the space used is s τ = poly log(ρ -1 0 ). We remark that simulating a distribution using uniform random bits always has some simulating bias. The above lemma is proved under the assumption that the simulation has no bias. But we can see that the statistical distance between the simulated distribution and the original distribution is at most poly(ρ 0 ) = 1/ poly(N ), which is small enough by our picking of ∆ ′ = poly(N, δ -1 ), ρ ′ = 1/ poly(N, δ -1 ). So if we consider simulation bias, then we can show that for every subset S ⊆ {0, 1, . . . , N }, the joint distribution α x , x ∈ S has a statistical distance O(|S|ε 0 ) to the joint distribution ⟨ω, π 1 (x)⟩, x ∈ S. Later we will only use the case that |S| = 2, i.e. two points. So the overall statistical distance is δ -Θ(1) which does not affect our analysis and parameters.

F.4 REDUCING THE TIME COMPLEXITY USING PRGS

Next we use a pseudorandom generator to replace the randomness used in the above construction. A function G : {0, 1} r → {0, 1} n a pseudorandom generator for space s computations with error parameter ε g , if for every probabilistic TM M with space s using n bits randomness in the read-once mannerHere r is called the seed length of G.Theorem F.3 (Nisan 1992) . For every n ∈ N and s ∈ N, there exists an pseudorandom generator G : {0, 1} r → {0, 1} n for space s computations with parameter ε g , where r = O(log n(log n + s + log 1 εg )). G can be computed in polynomial time (in n, r) and O(r) space. Moreover, given an index i ∈ [n], the i-th bit of the output of G can be computed in time poly(r).Let G : {0, 1} r → {0, 1} ℓ , ℓ = 2dDN τ be a pseudorandom generator for space s = c 1 (log N +s τ ), with ε g = δ/2, τ = c 2 log N for some large enough constants c 1 , c 2 . Again we only need to consider the construction corresponding to the first output dimension of ϕ•π 1 . We replace the randomness U ℓ used in the construction by output of G. That is, when we need τ uniform random bits to construct a distribution α v in the tree, we first compute positions of these bits in U ℓ and then compute the corresponding bits in the output of G. Then use them to do the construction in the same way. We denote this mapping using pseudorandomness as our final mapping π * . Now we provide a test algorithm to show that the feature mapping provided by the pseudorandom distribution has roughly the same quality as that of the mapping provided by the true randomness. We denote the test algorithm as T = T K,x,y,ε where x, y ∈ R d and K is a Laplacian kernel with feature mapping φ. T works as the following. Its input is the randomness either being U ℓ or G(U r ). T first computes dist φ (x, y). Notice that T actually does not have to compute φ since the distance can be directly computed as 2 -2K(x, y). Then T (G(U r )) computes dist π * (x, y) and testNotice that when the input is U ℓ , then this algorithm T is instead testingRecall that π ′ is defined in the previous section as our mapping using true randomness.Next we consider using T on true randomness.Proof. By Lemma F.2, dist π ′ (x, y) = dist π (x, y). By Theorem 1.2 setting the error probability to be δ/2, we haveNotice that the event T (U ℓ ) = 1 is indeed |dist π ′ (x, y) -dist φ (x, y)| ≤ ε dist φ (x, y). Hence the lemma holds. Now we show that T is actually a small space computation. Lemma F.4. T runs in space c(log N + s τ ) for some constant c and the input is read-once.Proof. The computing of dist φ (x, y) is in space O(log N ), since x, y ∈ N, x, y ≤ N and the kernel function K can be computed in that space. Now we focus on the computation of π ′ . We claim that by the construction of α x in section F.3, π ′ can be computed using space O(s τ +log N ). The procedure proceeds as the following. First it finds the path to the x-th leaf. This takes space O(log N ). Then along this path, for each node we need to compute a distribution α v . This takes space O(s τ ). Also notice that since the randomness is presented layer by layer, the procedure only needs to do a readonce sweep of the randomness. T needs to compute π ′ for both x and y, but this only blow up the space by 2. So the overall space needed is as stated.Finally we prove our theorem by using the property of the PRG.Proof of Theorem F.1. We first show our result assuming x, y ∈ N, x, y ≤ N for an integer N . We claim that π * is the mapping we want. By lemma F. 

