ON THE RELATIVE ERROR OF RANDOM FOURIER FEATURES FOR PRESERVING KERNEL DISTANCE

Abstract

The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS'07), is a powerful technique to find approximate lowdimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with relative error is less understood. We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions. We complement this by showing as long as the shift-invariant kernel is analytic, RFF with poly(ε -1 log n) dimensions achieves ε-relative error for pairwise kernel distance of n points, and the dimension bound is improved to poly(ε -1 log k) for the specific application of kernel k-means. Finally, going beyond RFF, we make the first step towards dataoblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar poly(ε -1 log n) dimension bound for Laplacian kernels. We also validate the dimension-error tradeoff of our methods on simulated datasets, and they demonstrate superior performance compared with other popular methods including random-projection and Nyström methods.

1. INTRODUCTION

We study the ability of the random Fourier features (RFF) method (Rahimi & Recht, 2007) for preserving the relative error for the kernel distance. Kernel method (Schölkopf & Smola, 2002) is a systematic way to map the input data into a (indefinitely) high dimensional feature space to introduce richer structures, such as non-linearity. In particular, for a set of n data points P , a kernel function K : P × P → R implicitly defines a feature mapping φ : P → H to a feature space H which is a Hilbert space, such that ∀x, y, K(x, y) = ⟨φ(x), φ(y)⟩. Kernel methods have been successfully applied to classical machine learning (Boser et al., 1992; Schölkopf et al., 1998; Girolami, 2002) , and it has been recently established that in a certain sense the behavior of neural networks may be modeled as a kernel (Jacot et al., 2018) . Despite the superior power and wide applicability, the scalability has been an outstanding issue of applying kernel methods. Specifically, the representation of data points in the feature space is only implicit, and solving for the explicit representation, which is crucially required in many algorithms, takes at least Ω(n 2 ) time in the worst case. While for many problems such as kernel SVM, it is possible to apply the so-called "kernel trick" to rewrite the objective in terms of K(x, y), the explicit representation is still often preferred, since the representation is compatible with a larger range of solvers/algorithms which allows better efficiency. In a seminal work (Rahimi & Recht, 2007) , Rahimi and Recht addressed this issue by introducing the method of random Fourier features (see Section 2 for a detailed description), to compute an explicit low-dimensional mapping φ ′ : P → R D (for D ≪ n) such that ⟨φ ′ (x), φ ′ (y)⟩ ≈ ⟨φ(x), φ(y)⟩ = K(x, y), for shift-invariant kernels (i.e., there exists K : P → R, such that K(x, y) = K(x -y)) which includes widely-used Gaussian kernels, Cauchy kernels and Laplacian kernels. Towards understanding this fundamental method of RFF, a long line of research has focused on analyzing the tradeoff between the target dimension D and the accuracy of approximating K under certain error measures. This includes additive error max x,y |⟨φ(x), φ(y)⟩ -⟨φ ′ (x), φ ′ (y)⟩| (Rahimi & Recht, 2007; Sriperumbudur & Szabó, 2015; Sutherland & Schneider, 2015) , spectral error (Avron et al., 2017; Choromanski et al., 2018; Zhang et al., 2019; Erdélyi et al., 2020; Ahle et al., 2020) , and the generalization error of several learning tasks such as kernel SVM and kernel ridge regression (Avron et al., 2017; Sun et al., 2018; Li et al., 2021) . A more comprehensive overview of the study of RFF can be found in a recent survey (Liu et al., 2021) . We focus on analyzing RFF with respect to the kernel distance. Here, the kernel distance of two data points x, y is defined as their (Euclidean) distance in the feature space, i.e., dist φ (x, y) = ∥φ(x) -φ(y)∥ 2 . While previous results on the additive error of K(x, y) (Rahimi & Recht, 2007; Sriperumbudur & Szabó, 2015; Sutherland & Schneider, 2015; Avron et al., 2017) readily implies additive error guarantee of dist φ (x, y), the relative error guarantee is less understood. As far as we know, Chen & Phillips (2017) is the only previous work that gives a relative error bound for kernel distance, but unfortunately, only Gaussian kernel is studied in that work, and whether or not the kernel distance for other shift-invariant kernels is preserved by RFF, is still largely open. In spirit, this multiplicative error guarantee of RFF, if indeed exists, makes it a kernelized version of Johnson-Lindenstrauss Lemma (Johnson & Lindenstrauss, 1984) which is one of the central result in dimension reduction. This guarantee is also very useful for downstream applications, since one can combine it directly with classical geometric algorithms such as k-means++ (Arthur & Vassilvitskii, 2007) , locality sensitive hashing (Indyk & Motwani, 1998) and fast geometric matching algorithms (Raghvendra & Agarwal, 2020) to obtain very efficient algorithms for kernelized k-means clustering, nearest neighbor search, matching and many more.

1.1. OUR CONTRIBUTIONS

Our main results are characterizations of the kernel functions on which RFF preserves the kernel distance with small relative error using poly log target dimensions. Furthermore, we also explore how to obtain data-oblivious dimension-reduction for kernels that cannot be handled by RFF. As mentioned, it has been shown that RFF with small dimension preserves the additive error of kernel distance for all shift-invariant kernels (Rahimi & Recht, 2007; Sriperumbudur & Szabó, 2015; Sutherland & Schneider, 2015) . In addition, it has been shown in Chen & Phillips (2017) that RFF indeed preserves the relative error of kernel distance for Gaussian kernels (which is shift-invariant). Hence, by analogue to the additive case and as informally claimed in Chen & Phillips (2017), one might be tempted to expect that RFF also preserves the relative error for general shift-invariant kernels as well. Lower Bounds. Surprisingly, we show that this is not the case. In particular, we show that for a wide range of kernels, including the well-known Laplacian kernels, it requires unbounded target dimension for RFF to preserve the kernel distance with constant multiplicative error. We state the result for a Laplacian kernel in the following, and the full statement of the general conditions of kernels can be found in Theorem 4.1. In fact, what we show is a quantitatively stronger result, that if the input is (∆, ρ)-bounded, then preserving any constant multiplicative error requires Ω(poly(∆/ρ)) target dimension. Here, a point x ∈ R d is (∆, ρ)-bounded if ∥x∥ ∞ ≤ ∆ and min i:xi̸ =0 |x i | ≥ ρ, i.e., the magnitude is (upper) bounded by ∆ and the resolution is (lower) bounded by ρ. Theorem 1.1 (Lower bound; see Remark 4.1). For every ∆ ≥ ρ > 0 and some feature mapping φ : R d → H of a Laplacian kernel K(x, y) = exp(-∥x -y∥ 1 ), if for every x, y ∈ R d that are (∆, ρ)-bounded, the RFF mapping π for K with target dimension D satisfies dist π (x, y) ∈ (1 ± ε) • dist φ (x, y) with constant probability, then D ≥ Ω( 1 ε 2 ∆ ρ ). This holds even when d = 1. Upper Bounds. Complementing the lower bound, we show that RFF can indeed preserve the kernel distance within 1 ± ε error using poly(ε -1 log n) target dimensions with high probability,

