SUBQUADRATIC ALGORITHMS FOR KERNEL MATRI-CES VIA KERNEL DENSITY ESTIMATION

Abstract

Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -given n input points, most kernel-based algorithms need to materialize the full n × n kernel matrix before performing any subsequent computation, thus incurring Ω(n 2 ) runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain subquadratic time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recently developed Kernel Density Estimation framework, which (after preprocessing in time subquadratic in n) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from weighted vertex and weighted edge sampling on kernel graphs, simulating random walks on kernel graphs, and importance sampling on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in sublinear (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a 9x decrease in the number of kernel evaluations over baselines for LRA and a 41x reduction in the graph size for spectral sparsification.

1. Introduction

For a kernel function k : R d × R d → R and a set X = {x 1 . . . x n } ⊂ R d of n points, the entries of the n × n kernel matrix K are defined as K i,j = k(x i , x j ). Alternatively, one can view X as the vertex set of a complete weighted graph where the weights between points are defined by the kernel matrix K. Popular choices of kernel functions k include the Gaussian kernel, the Laplace kernel, exponential kernel, etc; see (Schölkopf et al., 2002; Shawe-Taylor et al., 2004; Hofmann et al., 2008) for a comprehensive overview. Despite their wide applicability, kernel methods suffer from drawbacks, one of the main being efficiency -given n input points in d dimensions, many kernel-based algorithms need to materialize the full n × n kernel matrix K before performing the computation. For some problems this is unavoidable, especially if high-precision results are required (Backurs et al., 2017) . In this work, we show that we can in fact break this Ω(n 2 ) barrier for several fundamental problems in numerical linear algebra and graph processing. We obtain algorithms that run in o(n 2 ) time and scale inverselyproportional to the smallest entry of the kernel matrix. This allows us to skirt several known lower bounds, where the hard instances require the smallest kernel entry to be polynomially small in n. Our parameterization in terms of the smallest entry is motivated by the fact in practice, the smallest kernel value is often a fixed constant (March et al., 2015; Siminelakis et al., 2019; Backurs et al., 2019; 2021; Karppa et al., 2022) . We build on recently developed fast approximate algorithms for Kernel Density Estimation (Charikar & Siminelakis, 2017; Backurs et al., 2018; Siminelakis et al., 2019; Backurs et al., 2019; Charikar et al., 2020) . Specifically, these papers present fast approximate data structures with the following functionality: Definition 1.1 (Kernel Density Estimation (KDE) Queries). For a given dataset X ⊂ R d of size n, kernel function k, and precision parameter ε > 0, a KDE data structure supports the following operation: given a query y ∈ R d , return a value KDE X (y) that lies in the interval [(1 -ε)z, (1 + ε) z], where z = x∈X k(x, y), assuming that k(x, y) ≥ τ for all x ∈ X. The performance of the state of the art algorithms for KDE also scales proportional to the smallest kernel value of the dataset (see Table 1 ). In short, after a preprocessing time that is sub-quadratic (in n), KDE data structures use time sublinear in n to answer queries defined as above. Note that for all of our kernels, k(x, y) ≤ 1 for all inputs x, y.  d ε 2 τ 0.173+o(1) (Charikar et al., 2020) Exponential e -x-y 2 nd ε 2 τ 0.1+o(1) d ε 2 τ 0.1+o(1) (Charikar et al., 2020) Laplacian e -x-y 1 nd ε 2 τ 0.5 d ε 2 τ 0.5 (Backurs et al., 2019) Rational Quadratic 1 (1+ x-y 2 2 ) β nd ε 2 d ε 2 (Backurs et al., 2018)

1.1. Our Results

We show that given a KDE data structure as described above, it is possible to solve a variety of matrix and graph problems in subquadratic time o(n 2 ), i.e., sublinear in the matrix size. We emphasize that in our applications, we only require black-box access to KDE queries. Given this, we design algorithms for problems such as eigenvalue/eigenvector estimation, low-rank approximation, graph sparsification, local clustering, aboricity estimation, and estimating the total weight of triangles. Our results are obtained via the following two-pronged approach. First, we use KDE data structures to design algorithms for the following basic primitives, frequently used in sublinear time algorithms and property testing: In the second step, we use these primitives to implement a host of algorithms for the aforementioned problems. We emphasize that these primitives are used in a black-box manner, meaning that any further improvements to their running times will automatically translate into improved algorithms for the downstream problems. For our applications, we make the following parameterization, which we expand upon in Remark B.1 and Section B.1. At a high level, many of our applications, such as spectral sparsification, are succinctly characterized by the following parameterization. Parameterization 1.1. All of our algorithms are parameterized by the smallest edge weight in the kernel matrix, i.e., the smallest edge weight in the matrix K is at least τ .



. sampling vertices by their (weighted) degree in K (Theorems C.2 and C.4 and Algorithms 2 / 4), 2. sampling random neighbors of a given vertex by edge weights in K and sampling a random weighted edge (Theorem C.5 and Algorithms 5 and 6), 3. performing random walks in the graph K (Theorem C.7 and Algorithm 7), and 4. sampling the rows of the edge-vertex incident matrix and the kernel matrix K, both with probability proportional to respective row norms squared (Section D.1, Theorem D.1, and Section D.2, Corollary D.10 respectively).

Instantiations of KDE queries. The query times depend on the dimension d, accuracy ε, and lower bound τ . The parameter β is assumed to be a constant.

