ON THE RELATIVE ERROR OF RANDOM FOURIER FEATURES FOR PRESERVING KERNEL DISTANCE

Abstract

The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS'07), is a powerful technique to find approximate lowdimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with relative error is less understood. We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions. We complement this by showing as long as the shift-invariant kernel is analytic, RFF with poly(ε -1 log n) dimensions achieves ε-relative error for pairwise kernel distance of n points, and the dimension bound is improved to poly(ε -1 log k) for the specific application of kernel k-means. Finally, going beyond RFF, we make the first step towards dataoblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar poly(ε -1 log n) dimension bound for Laplacian kernels. We also validate the dimension-error tradeoff of our methods on simulated datasets, and they demonstrate superior performance compared with other popular methods including random-projection and Nyström methods.

1. INTRODUCTION

We study the ability of the random Fourier features (RFF) method (Rahimi & Recht, 2007) for preserving the relative error for the kernel distance. Kernel method (Schölkopf & Smola, 2002) is a systematic way to map the input data into a (indefinitely) high dimensional feature space to introduce richer structures, such as non-linearity. In particular, for a set of n data points P , a kernel function K : P × P → R implicitly defines a feature mapping φ : P → H to a feature space H which is a Hilbert space, such that ∀x, y, K(x, y) = ⟨φ(x), φ(y)⟩. Kernel methods have been successfully applied to classical machine learning (Boser et al., 1992; Schölkopf et al., 1998; Girolami, 2002) , and it has been recently established that in a certain sense the behavior of neural networks may be modeled as a kernel (Jacot et al., 2018) . Despite the superior power and wide applicability, the scalability has been an outstanding issue of applying kernel methods. Specifically, the representation of data points in the feature space is only implicit, and solving for the explicit representation, which is crucially required in many algorithms, takes at least Ω(n 2 ) time in the worst case. While for many problems such as kernel SVM, it is possible to apply the so-called "kernel trick" to rewrite the objective in terms of K(x, y), the explicit representation is still often preferred, since the representation is compatible with a larger range of solvers/algorithms which allows better efficiency. In a seminal work (Rahimi & Recht, 2007) , Rahimi and Recht addressed this issue by introducing the method of random Fourier features (see Section 2 for a detailed description), to compute an explicit low-dimensional mapping φ ′ : P → R D (for D ≪ n) such that ⟨φ ′ (x), φ ′ (y)⟩ ≈ ⟨φ(x), φ(y)⟩ =

