FASTER BINARY EMBEDDINGS FOR PRESERVING EUCLIDEAN DISTANCES

Abstract

We propose a fast, distance-preserving, binary embedding algorithm to transform a high-dimensional dataset T ⊆ R n into binary sequences in the cube {±1} m . When T consists of well-spread (i.e., non-sparse) vectors, our embedding method applies a stable noise-shaping quantization scheme to Ax where A ∈ R m×n is a sparse Gaussian random matrix. This contrasts with most binary embedding methods, which usually use x → sign(Ax) for the embedding. Moreover, we show that Euclidean distances among the elements of T are approximated by the 1 norm on the images of {±1} m under a fast linear transformation. This again contrasts with standard methods, where the Hamming distance is used instead. Our method is both fast and memory efficient, with time complexity O(m) and space complexity O(m) on well-spread data. When the data is not well-spread, we show that the approach still works provided that data is transformed via a Walsh-Hadamard matrix, but now the cost is O(n log n) per data point. Further, we prove that the method is accurate and its associated error is comparable to that of a continuous valued Johnson-Lindenstrauss embedding plus a quantization error that admits a polynomial decay as the embedding dimension m increases. Thus the length of the binary codes required to achieve a desired accuracy is quite small, and we show it can even be compressed further without compromising the accuracy. To illustrate our results, we test the proposed method on natural images and show that it achieves strong performance.

1. INTRODUCTION

Analyzing large data sets of high-dimensional raw data is usually computationally demanding and memory intensive. As a result, it is often necessary as a preprocessing step to transform data into a lower-dimensional space while approximately preserving important geometric properties, such as pairwise 2 distances. As a critical result in dimensionality reduction, the Johnson-Lindenstrauss (JL) lemma (Johnson & Lindenstrauss, 1984) guarantees that every finite set T ⊆ R n can be (linearly) mapped to a m = O( -2 log(|T |)) dimensional space in such a way that all pairwise distances are preserved up to an -Lipschitz distortion. Additionally, there are many significant results to speed up the JL transform by introducing fast embeddings, e.g. (Ailon & Chazelle, 2009; Ailon & Liberty, 2013; Krahmer & Ward, 2011; Nelson et al., 2014) , or by using sparse matrices (Kane & Nelson, 2014; 2010; Clarkson & Woodruff, 2017) . Such fast embeddings can usually be computed in O(n log n) versus the O(mn) time complexity of JL transforms that rely on unstructured dense matrices.

1.1. RELATED WORK

To further reduce memory requirements, progress has been made in nonlinearly embedding highdimensional sets T ⊆ R n to the binary cube {-1, 1} m with m n, a process known as binary embedding. Provided that d 1 (•, •) is a metric on R n , a distace preserving binary embedding is a map f : T → {-1, 1} m and a function d 2 (•, •) on {-1, 1} m × {-1, 1} m to approximate distances, i.e., |d 2 (f (x), f (y)) -d 1 (x, y)| ≤ α, for ∀x, y ∈ T . (1) The potential dimensionality reduction (m n) and 1-bit representation per dimension imply that storage space can be considerably reduced and downstream applications like learning and retrieval can happen directly using bitwise operations. Most existing nonlinear mappings f in (1) are generated using simple memory-less scalar quantization (MSQ). For example, given a set of unit vectors T ⊆ S n-1 with finite size |T |, consider the map q x := f (x) = sign(Gx) (2) where G ∈ R m×n is a standard Gaussian random matrix and sign(•) returns the element-wise sign of its argument. Let d 1 (x, y) = 1 π arccos( x -1 2 y -1 2 x, y ) be the normalized angular distance and d 2 (q x , q y ) = 1 2m q x -q y 1 be the normalized Hamming distance. Then, Yi et al. (2015) show that (1) holds with probability at least 1 -η if m α -2 log(|T |/η), so one can approximate geodesic distances with normalized Hamming distances. While this approach achieves optimal bit complexity (up to constants) (Yi et al., 2015) , it has been observed in practice that m is usually around O(n) to guarantee reasonable accuracy (Gong et al., 2013; Sánchez & Perronnin, 2011; Yu et al., 2014) . Much like linear JL embedding techniques admit fast counterparts, fast binary embedding algorithms have been developed to significantly reduce the runtime of binary embeddings (Gong et al., 2012b; Liu et al., 2011; Gong et al., 2012a; 2013; Li et al., 2011; Raginsky & Lazebnik, 2009) . Indeed, fast JL transforms (FJLT) and Gaussian Toeplitz matrices (Yi et al., 2015) , structured hashed projections (Choromanska et al., 2016) , iterative quantization (Gong et al., 2012b) , bilinear projection (Gong et al., 2013) , circulant binary embedding (Yu et al., 2014; Dirksen & Stollenwerk, 2018; 2017; Oymak et al., 2017; Kim et al., 2018) , sparse projection (Xia et al., 2015) , and fast orthogonal projection (Zhang et al., 2015) have all been considered. These methods can decrease time complexity to O(n log n) operations per embedding, but still suffer from some important drawbacks. Notably, due to the sign function, these algorithms completely discard all magnitude information, as sign(Ax) = sign(A(αx)) for all α > 0. So, all points in the same direction embed to the same binary vector and cannot be distinguished. Even if one settles for recovering geodesic distances, using the sign function in (2) is an instance of MSQ so the estimation error α in (1) decays slowly as the number of bits m increases (Yi et al., 2015) . In addition to the above data independent approaches, there are data dependent embedding methods for distance recovery, including product quantization (Jegou et al., 2010; Ge et al., 2013) , LSHbased methods (Andoni & Indyk, 2006; Shrivastava & Li, 2014; Datar et al., 2004) and iterative quantization (Gong et al., 2012c) . Their accuracy, which can be excellent, nevertheless depends on the underlying distribution of the input dataset. Moreover, they may be associated with larger time and space complexity for embedding the data. For example, product quantization performs k-means clustering in each subspace to find potential centroids and stores associated lookup tables. LSH-based methods need random shifts and dense random projections to quantize each input data point. Recently Huynh & Saab (2020) resolved these issues by replacing the simple sign function with a Sigma-Delta (Σ∆) quantization scheme, or alternatively other noise-shaping schemes (see (Chou & Güntürk, 2016) ) whose properties will be discussed in Section 3. They use the binary embedding q x := Q(DBx) ) where Q is now a stable Σ∆ quantization scheme, D ∈ R m×m is a diagonal matrix with random signs, and B ∈ R m×n are specific structured random matrices. To give an example of Σ∆ quantization in this context, consider w := DBx. Then the simplest Σ∆ scheme computes q x via the following iteration, run for i = 1, ..., m:    u 0 = 0, q x (i) = sign(w i + u i-1 ), u i = u i-1 + w i -q i . The choices of B in (Huynh & Saab, 2020) allow matrix vector multiplication to be implemented using the fast Fourier transform. Then the original Euclidean distance xy 2 can be recovered via a pseudo-metric on the quantized vectors given by d V (q x , q y ) := V (q x -q y ) 2 where V ∈ R p×m is a "normalized condensation operator", a sparse matrix that can be applied fast (see Section 3). Regarding the complexity of applying (3) to a single x ∈ R n , note that x → DBx has time complexity O(n log n) while the quantization map needs O(m) time and results in an m bit representation. So when m ≤ n, the total time complexity for (3) is around O(n log n).

1.2. METHODS AND CONTRIBUTIONS

We extend these results by replacing DB in (3) by a sparse Gaussian matrix A ∈ R m×n so that now q x := Q(Ax). Given scaled high-dimensional data T ⊂ R n contained in the 2 ball B n 2 (κ) with radius κ, we put forward Algorithm 1 to generate binary sequences and Algorithm 2 to compute estimates of the Euclidean distances between elements of T via an 1 -norm rather than 2 -norm. The contribution of this work is threefold. First, we prove Theorem 1.1 quantifying the performance of our algorithms. Algorithm 1: Fast Binary Embedding for Finite T Input: T = {x (j) } k j=1 ⊆ B n 2 (κ) Data points in 2 ball Generate A ∈ R m×n as in Definition 2.2 Sparse Gaussian matrix A for j ← 1 to k do z (j) ← Ax (j) q (j) = Q(z (j) ) Stable Σ∆ quantizer Q as in (4), or more generally (21). Output: Binary sequences B = {q (j) } k j=1 ⊆ {-1, 1} m Algorithm 2: 2 Norm Distance Recovery Input: q (i) , q (j) ∈ B Binary sequences produced by Algorithm 1 y (i) ← V q (i) Condense the components of q y (j) ← V q (j) Output: y (i) -y (j) 1 Approximation of x (i) -x (j) 2 Theorem 1.1 (Main result). Let T ⊆ R n be a finite, appropriately scaled set with elements satisfying x ∞ = O(n -1/2 x 2 ) and x 2 ≤ κ < 1. If m p := Ω( -2 log(|T | 2 /δ)) and r ≥ 1 is the integer order of Q, then with probability 1 -2δ on the draw of the sparse Gaussian matrix A, the following holds uniformly over all x, y in T : Embedding x, y into {-1, 1} m using Algorithm 1, and estimating the associated distance between them using Algorithm 2 yields the error bound d V (q x , q y ) -x -y 2 ≤ c m p -r+1/2 + x -y 2 where c > 0 is a constant. Theorem 1.1 yields an approximation error bounded by two components, one due to quantization and another that resembles the error from a linear JL embedding into a p-dimensional space. The latter part is essentially proportional to p -1/2 , while the quantization component decays polynomially fast in m, and can be made harmless by increasing m. Moreover, the number of bits m -2 log(|T |) achieves the optimal bit complexity required by any oblivious random embedding that preserves Euclidean or squared Euclidean distance, see Theorem 4.1 in (Dirksen & Stollenwerk, 2020) . Theorem 4.2 is a more precise version of Theorem 1.1, with all quantifiers, and scaling parameters specified explicitly, and with a potential modification to A that enables the result to hold for arbitrary (not necessarily well-spread) finite T , at the cost of increasing the computational complexity of embedding a point to O(n log n). We also note that if the data did not satisfy the scaling assumption of Theorems 1.1 and 4.2, then one can replace {-1, 1} by {-C, C}, and the quantization error would scale by C. Second, due to the sparsity of A, (6) can be computed much faster than (3), when restricting our results to "well-spread" vectors x, i.e., those that are not sparse. On the other hand, in Section 5, we show that Algorithm 1 achieves O(m) time and space complexity in contrast with the common O(n log n) runtime of fast binary embeddings, e.g., (Gong et al., 2013; Yi et al., 2015; Yu et al., 2014; Dirksen & Stollenwerk, 2018; 2017; Huynh & Saab, 2020 ) that rely on fast JL transforms or circulant matrices. Meanwhile, Algorithm 2 requires only O(m) runtime. Third, Definition 2.3 shows that V is sparse and essentially populated by integers bounded by (m/p) r where r, m, p are as in Theorem 1.1. In Section 5, we note that each y (i) = V q (i) (and the distance query), can be represented by O(p log 2 (m/p)) bits, instead of m bits, without affecting the reconstruction accuracy. This is a consequence of using the 1 -norm in Algorithm 2. Had we instead used an 2 -norm, we would have required O(p(log 2 (m/p)) 2 ) bits. Finally, we remark that while the assumption that the vectors x are well-spread (i.e. x ∞ = O(n -1/2 x 2 )) may appear restrictive, there are important instances where it holds. Natural images seem to be one such case, as are random Fourier features (Rahimi & Recht, 2007) . Similarly, Gaussian (and other subgaussian) random vectors satisfy a slightly weakened x ∞ = O(log(n)n -1/2 x 2 ) assumption with high probability, and one can modify our construction by slightly reducing the sparsity of A (and slightly increasing the computational cost) to handle such vectors. On the other hand, if the data simply does not satisfy such an assumption, one can still apply Theorem 4.2 part (ii), but now the complexity of embedding a point is O(n log n).

2. PRELIMINARIES

2.1 NOTATION AND DEFINITIONS Throughout, f (n) = O(g(n)) and f (n) = Ω(g(n)) mean that |f (n)| is bounded above and below respectively by a positive function g(n) up to constants asymptotically; that is, lim sup n→∞ |f (n)| g(n) < ∞. Similarly, we use f (n) = Θ(g(n) ) to denote that f (n) is bounded both above and below by a positive function g(n) up to constants asymptotically. We next define operator norms. Definition 2.1. Let α, β ∈ [1, ∞] be integers. The (α, β) operator norm of K ∈ R m×n is K α,β = max x =0 Kx β x α . We now introduce some notation and definitions that are relevant to our construction. Definition 2.2 (Sparse Gaussian random matrix). Let A = (a ij ) ∈ R m×n be a random matrix with i.i.d. entries such that a ij is 0 with probability 1 -s and is drawn from N (0, 1 s ) with probability s. We adopt the definition of a condensation operator of Chou & Güntürk (2016) ; Huynh & Saab (2020) . Definition 2.3 (Condensation operator). Let p, r, λ be fixed positive integers such that λ = r λ-r+1 for some integer λ. Let m = λp and v be a row vector in R λ whose entry v j is the j-th coefficient of the polynomial (1 + z + . . . + z λ-1 ) r . Define the condensation operator V ∈ R p×m by V = I p ⊗ v =    v . . . v    . For example, when r = 1, λ = λ, and v ∈ R λ is simply the vector of all ones. The normalized condensation operator is given by V = π/2 p v 2 V . The fast JL transform was first studied by Ailon & Chazelle (2009) . It admits many variants and improvements, e.g. (Krahmer & Ward, 2011; Matoušek, 2008) . The idea is that given any x ∈ R n we use a fast "Fourier-like" transform, like the Walsh-Hadamard transform, to distribute the total mass (i.e. ||x|| 2 ) of x relatively evenly to its coordinates. Definition 2.4 (FJLT). The fast JL transform can be obtained by Φ := AHD ∈ R m×n . Here, A ∈ R m×n is a sparse Gaussian random matrix, as in Definition 2.2, while H ∈ R n×n is a normalized Walsh-Hadamard matrix defined by H ij = n -1/2 (-1) i-1,j-1 where i, j is the bitwise dot product of the binary representations of the numbers i and j. Finally, D ∈ R n×n is diagonal with diagonal entries drawn independently from {-1, 1} with probability 1/2 for each.

2.2. CONDENSED JOHNSON-LINDENSTRAUSS TRANSFORMS

Definition 2.5. When V is a condensation operator, and A is a sparse Gaussian, we refer to V A as a condensed sparse JL transform (CSJLT). When A is replaced by Φ as in Definition 2.4 we refer to V Φ as a condensed fast JL transform (CFJLT). The definition above is justified by the following lemma (see Appendix B for the proof). Lemma 2.6 (CJLT lemma). Let T be a finite subset of R n , λ ∈ N, ∈ (0, 1 2 ), δ ∈ (0, 1), p = O( -2 log(|T | 2 /δ)) ∈ N and m = λp. Let V ∈ R p×m be as in Definition 2.3, A ∈ R m×n be the sparse Gaussian matrix in Definition 2.2 with s = Θ( -1 n -1 ( v ∞ / v 2 ) 2 ) ≤ 1, and Φ = AHD ∈ R m×n be the FJLT in Definition 2.4 with s = Θ( -1 n -1 ( v ∞ / v 2 ) 2 log n) ≤ 1. If T consists of well-spread vectors, that is, x ∞ = O(n -1/2 x 2 ) for all x ∈ T , then V A(x -y) 1 -x -y 2 ≤ x -y 2 (8) holds uniformly for all x, y ∈ T with probability at least 1 -δ. If T is finite but arbitrary, then V Φ(x -y) 1 -x -y 2 ≤ x -y 2 (9) holds uniformly for all x, y ∈ T with probability at least 1 -δ. So T ⊆ R n is embedded into R p with pairwise distances distorted at most , where p = O( -2 log |T |) as one would expect from a JL embedding. This will be needed to guarantee the accuracy associated with our embeddings algorithms. Note that the bound on p does not require extra logarithmic factors, in contrast to the bound O( -2 log |T | log 4 n) in (Huynh & Saab, 2020) . 3 SIGMA-DELTA QUANTIZATION An r-th order Σ∆ quantizer Q (r) : R m → A m maps an input signal y = (y i ) m i=1 ∈ R m to a quantized sequence q = (q i ) m i=1 ∈ A m via a quantization rule ρ and the following iterations    u 0 = u -1 = . . . = u 1-r = 0, q i = Q(ρ(y i , u i-1 , . . . , u i-r )) for i = 1, 2, . . . , m, P r u = y -q (10) where Q(y) = arg min v∈A |y -v| is the scalar quantizer related to alphabet A and P ∈ R m×m is the first order difference matrix defined by P ij =    1 if i = j, -1 if i = j + 1, 0 otherwise. Note that ( 10) is amenable to an iterative update of the state variables u i as P r u = y -q ⇐⇒ u i = r j=1 (-1) j-1 r j u i-j + y i -q i , i = 1, 2, . . . , m. Definition 3.1. A quantization scheme is stable if there exists µ > 0 such that for each input with y ∞ ≤ µ, the state vector u ∈ R m satisfies u ∞ ≤ C. Crucially, µ and C do not depend on m. Stability heavily depends on the choice of quantization rule and is difficult to guarantee for arbitrary ρ in (10) when the alphabet is small, as is the case of 1-bit quantization where A = {±1}. When r = 1 and A = {±1}, the simplest stable Σ∆ scheme Q (1) : R m → A m is equipped with the greedy quantization rule ρ(y i , u i-1 ) := u i-1 +y i giving the simple iteration (4) from the introduction, albeit with y i replacing w i . A description of the design and properties of stable Q (r) with r ≥ 2 can be found in Appendix C.

4. MAIN RESULTS

The ingredients that make our construction work are a JL embedding followed by Σ∆ quantization. Together these embed points into {±1} m , but it remains to define a pseudometric so that we may approximate Euclidean distances by distances on the cube. We now define this pseudometric. Definition 4.1. Let A m = {±1} m and let V ∈ R p×m with p ≤ m. We define d V on A m × A m as d V (q 1 , q 2 ) = V (q 1 -q 2 ) 1 ∀ q 1 , q 2 ∈ A m . We now present our main result, a more technical version of Theorem 1.1, proved in Appendix D. Theorem 4.2 (Main result). Let λ, r ∈ N, ∈ (0, 1 2 ), δ ∈ (0, 1), β = Ω(log(|T |/δ)) > 0, µ ∈ (0, 1), p = Ω( -2 log(|T | 2 /δ)) ∈ N, and m = λp. Let V ∈ R p×m be as in Definition 2.3, A ∈ R m×n be the sparse Gaussian matrix in Definition 2.2 with s = Θ( -1 n -1 ( v ∞ / v 2 ) 2 ) ≤ 1, and Φ be the FJLT in Definition 2.4 with s = Θ( -1 n -1 ( v ∞ / v 2 ) 2 log n) ≤ 1. Let T be a finite subset of B n 2 (κ) := {x ∈ R n : x 2 ≤ κ} and suppose that κ ≤ µ 2 β + log(2m) . Defining the embedding maps f 1 : T → {±1} m by f 1 = Q (r) • A and f 2 : T → {±1} m by f 2 = Q (r) • Φ, there exists a constant C(µ, r) such that the following are true: (i) If the elements of T satisfy x ∞ = O(n -1/2 x 2 ), then the bound d V (f 1 (x), f 1 (y)) -x -y 2 ≤ C(µ, r)λ -r+1/2 + x -y 2 (12) holds uniformly for all x, y ∈ T with probability exceeding 1 -δ -|T |e -β . (ii) On the other hand, for arbitrary T ⊂ B n 2 (κ) d V (f 2 (x), f 2 (y)) -x -y 2 ≤ C(µ, r)λ -r+1/2 + x -y 2 (13) holds uniformly for any x, y ∈ T with probability exceeding 1 -δ -2|T |e -β . Under the assumptions of Theorem 4.2, we have = O log(|T | 2 /δ) p 1 √ p . By ( 12), ( 13) and ( 14), we have that with high probability the inequality d V (f i (x), f i (y)) -x -y 2 ≤ C(µ, r) m p -r+1/2 + x -y 2 ≤ C(µ, r) m p -r+1/2 + 2κ ≤ C(µ, r) m p -r+1/2 + µ β + log(2m) • C 2 √ p holds uniformly for x, y ∈ T . The first error term in (15) results from Σ∆ quantization while the second error term is caused by the CJLT. So the term O((m/p) -r+1/2 ) dominates when λ = m/p is small. If m/p is sufficiently large, the second term O(1/ √ p) becomes dominant.

5. COMPUTATIONAL AND SPACE COMPLEXITY

In this section, we assume that T = {x (j) } k j=1 ⊆ R n consists of well-spread vectors. Moreover, we will focus on stable r-th order Σ∆ schemes Q (r) : R m → A m with A = {-1, 1}. By Definition 2.3, when r = 1 we have v = (1, 1, . . . , 1) ∈ R λ , while when r = 2, v = (1, 2, . . . , λ -1, λ, λ - 1, . . . , 2, 1) ∈ R λ . In general, v ∞ / v 2 = O(λ -1/2 ) holds for all r ∈ N. We also assume that s = Θ( -1 n -1 ( v ∞ / v 2 ) 2 ) = Θ( -1 n -1 λ -1 ) ≤ 1 as in Theorem 4.2. We consider b-bit floating-point or fixed-point representations for numbers. Both entail the same computational complexity for computing sums and products of two numbers. Addition and subtraction require O(b) operations while multiplication and division require M(b) = O(b 2 ) operations via "standard" long multiplication and division. Multiplication and division can be done more efficiently, particularly for large integers and the best known methods (and best possible up to constants) have complexity (Harvey & Van Der Hoeven, 2019) . We also assume random access to the coordinates of our data points. M(b) = O(b log b) Embedding complexity. For each data point x (j) ∈ T , one can use Algorithm 1 to quantize it. Since A has sparsity constant s = Θ( -1 n -1 λ -1 ) and -1 = O(p 1/2 ) by ( 14), and since λ = m/p, computing Ax (j) needs O(snm) = O(λ -1 -1 m) = O(p 3/2 ) time. Additionally, it takes O(m) time to quantize Ax (j) based on (21). When p 3/2 ≤ m, Algorithm 1 can be executed in O(m) for each x (j) . Because A has O(snm) = O(m) nonzero entries, the space complexity is O(m) bits per data point. Note that the big O notation here hides the space complexity dependence on the bit-depth b of the fixed or floating point representation of the entries of A and x (j) . This clearly has no effect on the storage space needed for each q (j) , which is exactly m bits. Complexity of distance estimation. If one does not use embedding methods, storing T directly, i.e., by representing the coefficients of each x (j) by b bits requires knb bits. Moreover, the resulting computational complexity of estimating xy 2 2 where x, y ∈ T is O(nM(b)). On the other hand, suppose we obtain binary sequences B = {q (j) } k j=1 ⊆ A m by performing Algorithm 1 on T . Using our method with accuracy guaranteed by Theorem 4.2, high-dimensional data points T ⊆ R n are now transformed into short binary sequences, which only require km bits of storage instead of knb bits. Algorithm 2 can be applied to recover the pairwise 2 distances. Note that V is the normalization of an integer valued matrix V = I p ⊗ v (by Definition 2.3) and q (i) ∈ A m is a binary vector. So, by storing the normalization factor separately, we can ignore it when considering runtime and space complexity. Thus we observe: 1. The number of bits needed to represent each entry of v is at most log 2 ( v ∞ ) ≈ (r -1) log 2 λ = O(log 2 λ) when r > 1 and O(1) when r = 1. So the computation of y (i) = V q (i) ∈ R p only involves m additions or subtractions of integers represented by O(log 2 λ) bits and thus the time complexity in computing y (i) is O(m log 2 λ). 2. Each of the p entries of y (i) is the sum of λ terms each bounded by λ r-1 . We can store y (i) in O(p log 2 λ) bits. 

3.. Computing y

(i) -y (j O(n log n) O(n) O(m) O(m) Bilinear (Gong et al., 2013) O(n √ m) O( √ mn) O(m) O(m) Circulant (Yu et al., 2014) O(n log n) O(n) O(m) O(m) BOE or PCE (Huynh & Saab, 2020) O(n log n) O(n) O(p log 2 λ) O(pM(log 2 λ)) Our Algorithm (on well-spread T ) O(m) O(m) O(p log 2 λ) O(p log 2 λ) These algorithms recover Euclidean distances and others recover geodesic distances. Table 1 : Here "Time" is the time needed to embed a data point, while "Space" is the space needed to store the embedding matrix. "Storage" contains the memory usage to store each encoded sequence. "Query time" is the time complexity of pairwise distance estimation.

Comparisons with baselines.

In Table 1 , we compare our algorithm with various JL-based methods from Section 1. Here n is the input dimension, m is the embedding dimension (and number of bits), and p = m/λ is the length of encoded sequences y = V q. In our case, we use O(p log 2 λ) to store y = V q. See Appendix E for a comparison with product quantization. Method 1. We quantize FJLT embeddings Φx, and recover distances based on Algorithm 2. Method 2. We quantize sparse JL embeddings Ax and recover distances by Algorithm 2. In order to test the performance of our algorithm, we compute the mean absolute percentage error (MAPE) of reconstructed 2 distances averaged over all pairwise data points, that is, 2 k(k -1) x,y∈T V (q x -q y ) 1 -x -y 2 x -y 2 . Experiments on the Yelp dataset. To give a numerical illustration of the relation among the length m of the binary sequences, embedding dimension p, and order r, as compared to the upper bound in (15), we use both Method 1 and Method 2 on the Yelp dataset. We randomly sample k = 1000 images and scale them by the same constant so all data points are contained in the 2 unit ball. The scaled dataset is denoted by T . Based on Theorem 4.2, we set n = 16384 and s = 1650/n ≈ 0.1. For each fixed p, we apply Algorithm 1 and Algorithm 2 for various m. We present our experimental results for stable Σ∆ quantization schemes, given by ( 21), with r = 1 and r = 2 in Figure 1 . For r = 1, we observe that the curve with small p quickly reaches an error floor while with high p the error decays like m -foot_0/2 and eventually reach a lower floor. The reason is that the first error term in ( 15) is dominant when m/p is relatively small but the second error term eventually dominates as Next, we illustrate the relationship between the quantization order r and the number of measurements m in Figure 2 . The curves obtained directly from an unquantized CFJLT (resp. CSJLT) as in Lemma 2.6, with m = 256, 512, 1024, 2048, 4096, and p = 64 are used for comparison against the quantization methods. The first row of Figure 2 depicts the mean squared relative error when p = 64 is fixed for all distinct methods. It shows that stable quantization schemes with order r > 1 outperform the first order greedy quantization method, particularly when m is large. Moreover, both the r = 2 and r = 3 curves converge to the CFJLT/CSJLT result as m goes to 4096. Note that by using a quarter of the original dimension, i.e. m = 4096, our construction achieves less than 10% error. Furthermore, if we encode V q as discussed in Section 5, then we need at most rp log 2 λ = 64r log 2 (4096/64) = 384r bits per image, which is 0.023 bits per pixel. For our final experiment, we illustrate that the performance of the proposed approach can be further improved. Note that the choice of p only affects the distance computation in Algorithm 2 and does not appear in the embedding algorithm. In other words, one can vary p in Algorithm 2 to improve performance. This can be done either analytically by viewing the right hand side of (15) as a function of p and optimizing for p (up to constants). It can also be done empirically, as we do here. Following this intuition, if we vary p as a function of m, and use the empirically optimal p := p(m) in the construction of V , then we obtain the second row of Figure 2 where the choice r = 3 exhibits lower error than other quantization methods. Note that the decay rate, as a function of m, very closely resembles that of the unquantized JL embedding particularly for higher orders r (as one can verify by optimizing the right hand side of ( 15)). A COMPARISONS ON DIFFERENT DATASETS Experiments on the Yelp dataset in Section 6 showed that Method 2 based on sparse JL embeddings performs as well as Method 1 which usues an FJLT to enforce the well-spreadness assumption. Now, we only focus on Method 2 and check its performance on all four different datasets: Yelp, ImageNet, Flickr30k, and CIFAR-10. Specifically, for each dataset we randomly sample k = 1000 images and scale them such that all scaled data points are contained in the 2 unit ball. Then we apply Method 2 to each dataset separately and compute the corresponding MAPE metric, see Figure 3 , where we fix p = 64 and let r = 1, 2. We can observe that curves with r = 1 fluctuate, but displays a clear downward trend, when m ≤ 8192 and reach an error floor around 0.08. In contrast to the first order quantization scheme, curves with r = 2 decays faster and eventually achieve a lower floor around 0.07. Additionally, Method 2 performs well on all datasets and implies that assumption of well-spread input vectors is not too restrictive on natural images. B PROOF OF LEMMA 2.6 We will require the following lemmas, adapted from the literature, to prove the distance-preserving properties of our condensed sparse Johnson-Lindenstrauss transform (CSJLT) and condensed fast Johnson-Lindenstrauss transform (CFJLT) in Lemma 2.6. Lemma B.1 (Theorem 5.1 in Matoušek (2008)). Let n ∈ N, ∈ (0, 1 2 ), δ ∈ (0, 1), α ∈ [ 1 √ n , 1] be parameters and set m = C -2 log(δ -1 ) ∈ N where C is a sufficiently large constant. Let s = 2α 2 / ≤ 1, A ∈ R m×n be as in Definition 2.2. Then P (1 -) x 2 ≤ π/2 m Ax 1 ≤ (1 + ) x 2 ≥ 1 -δ (16) holds for all x ∈ R n with x ∞ ≤ α x 2 . Lemma B.2 below is adapted from (Ailon & Chazelle, 2009 , Lemma 1), and we present its proof for completeness. Lemma B.2. Let H ∈ R n×n and D ∈ R n×n be as in Definition 2.4. For any λ > 0 and x ∈ R n we have P HDx ∞ ≤ λ x 2 ≥ 1 -2ne -nλ 2 /2 . ( ) Proof. Without loss of generality, we can assume x 2 = 1. Let u = HDx = (u 1 , . . . , u n ). Fix i ∈ {1, . . . , n}. Then u i = n j=1 a j x j with P a j = 1 √ n = P a j = -1 √ n = 1 2 for all j. Moreover, a 1 , a 2 , . . . , a n are independent and symmetric. So u i is also symmetric, that is, u i and -u i share the same distribution. For any t ∈ R we have E(e tnui ) = n j=1 E[exp(tna j x j )] = n j=1 exp(t √ nx j ) + exp(-t √ nx j ) 2 ≤ n j=1 exp(nt 2 x 2 j /2) = exp(nt 2 /2). Since u i is symmetric, by Markov's inequality and the above result, we get P (|u i | ≥ λ) = 2P (e λnui ≥ e λ 2 n ) ≤ 2e -λ 2 n E(e λnui ) = 2e -λ 2 n/2 . Inequality (17) follows by the union bound over all i ∈ {1, . . . , n}. Lemma B.3. Let n, λ ∈ N, ∈ (0, 1 2 ), δ ∈ (0, 1), p = O( -2 log(δ -1 )) ∈ N and m = λp. Let V ∈ R p×m be as in Definition 2.3, A ∈ R m×n be the sparse Gaussian matrix in Definition 2.2 with s = Θ( -1 n -1 ( v ∞ / v 2 ) 2 ) ≤ 1, and Φ = AHD ∈ R m×n be the FJLT in Definition 2.4 with s = Θ( -1 n -1 ( v ∞ / v 2 ) 2 log n) ≤ 1. Then for x ∈ R n with x ∞ = O(n -1/2 x 2 ), we have P (1 -) x 2 ≤ V Ax 1 ≤ (1 + ) x 2 ≥ 1 -δ, and for arbitrary x ∈ R n , we have P (1 -) x 2 ≤ V Φx 1 ≤ (1 + ) x 2 ≥ 1 -δ. Proof. Recall that V = I p ⊗v and Φ = AHD. Let y ∈ R n and K := V A = (I p ⊗v)A ∈ R p×n . For 1 ≤ i ≤ p and 1 ≤ j ≤ n, we have K ij = λ k=1 v k a (i-1)λ+k,j . Denote the row vectors of A by a 1 , a 2 , . . . , a m . It follows that (Ky) i = n j=1 K ij y j = n j=1 λ k=1 y j v k a (i-1)λ+k,j = λ k=1 v k y, a (i-1)λ+k = [B(v T ⊗ y)] i where B :=     a 1 a 2 . . . a λ a λ+1 a λ+2 . . . a 2λ . . . . . . . . . a (p-1)λ+1 a (p-1)λ+2 . . . a pλ     ∈ R p×λn and v T ⊗ y =     v 1 y v 2 y . . . v λ y     ∈ R λn . Hence V Ay = Ky = B(v T ⊗ y) holds for all y ∈ R n . Additionally, we get a reshaped sparse Gaussian random matrix B by rearranging the rows of A. For the first assertion in the theorem, note that 18) holds by applying Lemma B.1 to random matrix B and vector v ⊗ x with α = Θ(n -1/2 v ∞ / v 2 ). x ∈ R n satisfies x ∞ = O( x 2 / √ n). So, we have V Ax = B(v T ⊗ x), v T ⊗ x 2 = v 2 x 2 and v T ⊗ x ∞ = v ∞ x ∞ . Then ( For the second assertion, if x ∈ R n is arbitrary, then by substituting HDx for y one can get V Φx = B(v T ⊗ (HDx)). Note that v T ⊗ (HDx) 2 = v 2 HDx 2 = v 2 x 2 and v T ⊗ (HDx) ∞ = v ∞ HDx ∞ . Inequality ( 19) follows immediately by using the above fact and applying Lemma B.1 and Lemma B.2 to the random operator B and vector v T ⊗ (HDx) with α = Θ((n -1 log n) 1/2 v ∞ / v 2 ). Now we can embed a set of points in a high dimensional space into a space of much lower dimension in such a way that distances between the points are nearly preserved. By substituting δ with 2δ/|T | 2 in Lemma B.3 and using the fact 1 -|T | 2 2δ |T | 2 = 1 -|T |(|T |-1) 2 • 2δ |T | 2 > 1 -δ, Lemma 2. 6 follows from the union bound over all pairwise data points in T .

C STABLE SIGMA-DELTA QUANTIZATION AND ITS PROPERTIES

Although it is a non-trivial task to design a stable quantization rule ρ when r > 1, families of onebit Σ∆ quantization schemes that achieve this goal have been designed by Daubechies & DeVore (2003) ; Güntürk (2003) ; Deift et al. (2011) , and we now describe one such family. To start, note that an r-th order Σ∆ quantization scheme may also arise from a more general difference equation of the form yq = f * v (20) where * denotes convolution and the sequence f = P r g with g ∈ 1 . Then any (bounded) solution v of (20) generates a (bounded) solution u of (11) via u = g * v. Thus (11) can be rewritten in the form (20) by a change of variables. Defining h := δ (0) -f , where δ (0) denotes the Kronecker delta sequence supported at 0, and choosing the quantization rule ρ in terms of the new variable as (h * v) i + y i . Then (10) reads as q i = Q((h * v) i + y i ), v i = (h * v) i + y i -q i . By designing a proper filter h one can get a stable r-th order Σ∆ quantizer, as was done in Deift et al. (2011); Güntürk (2003) , leading to the following result from Güntürk (2003) , which exploits the above relationship between v and u to bound u ∞ . Proposition C.1. Fix an integer r, an integer σ ≥ 6 and let n j = σ(j -1) 2 + 1 for j = 1, 2, . . . , r. Let the filter h be of the form h = r j=1 d j δ nj where δ nj is the Kronecker delta supported at n j and d j = i =j ni ni-nj for j = 1, 2, . . . , r. There exists a universal constant C > 0 such that the rth order Σ∆ scheme (21) with 1-bit alphabet A = {-1, 1}, is stable, and y ∞ ≤ µ < 1 =⇒ u ∞ ≤ Cc(µ) r r r , ( ) where c(µ) > 0 is a constant only depends on µ. By a union bound, (23), and ( 27) P Ay ∞ ≥ µ = P max 1≤i≤m |Y i | ≥ µ ≤ mP |Y i | ≥ µ = 2me -µ 2 /4 ≤ e -β . (28) It follows immediately from ( 26) and ( 28) with y = HDx that P ( Φx ∞ ≤ µ) = P ( AHDx ∞ ≤ µ) ≥ P ( AHDx ∞ ≤ µ, HDx ∞ ≤ λ) = P ( AHDx ∞ ≤ µ | HDx ∞ ≤ λ)P ( HDx ∞ ≤ λ) ≥ (1 -e -β ) 2 ≥ 1 -2e -β . Furthermore, if we replace y by x in (28) and use A with s = Θ( -1 n -1 ), then inequality (24) follows. The difference in the choice of s is due to the fact that for vectors in the unit ball with x ∞ = O(n -1/2 x 2 ) we have that x ∞ ≤ n -1/2 . D PROOF OF THEOREM 4.2 Proof. Since the proofs of ( 12) and ( 13) are almost identical except for using different random projections A and Φ, we shall only establish the result for (13) in detail. For any x ∈ T ⊆ B n 2 (κ) we have x 2 ≤ κ. By applying Lemma C.3 we get P Φx ∞ < µ ≥ P Φx ∞ < µ x 2 /κ ≥ P Φx ∞ < 2 β + log(2m) x 2 ≥ 1 -2e -β . Since above inequality holds for arbitrary x ∈ T , by union bound one can get P max x∈T Φx ∞ < µ ≥ 1 -2|T |e -β . Suppose that u x is the state vector of input signal Φx which is produced by stable r-th order Σ∆ scheme. Using Lemma C.2 and formula (22) to get V P r ∞,1 u x ∞ ≤ Cc(µ) r r r (8r) r+1 π/2λ -r+1/2 , which holds uniformly for all x ∈ T with probability exceeding 1 -2|T |e -β . Furthermore, by Lemma 2.6 the probability that V Φ(x -y) 1 -x -y 2 ≤ x -y 2 holds simultaneously for all x, y ∈ T is at least 1 -δ. We deduce from triangle inequality and equations ( 29), (30) that d V (f 2 (x), f 2 (y)) -x -y 2 = V Q (r) (Φx) -V Q (r) (Φy) 1 -x -y 2 ≤ V Q (r) (Φx) -V Q (r) (Φy) 1 -V Φ(x -y) 1 + V Φ(x -y) 1 -x -y 2 ≤ V (Q (r) (Φx) -Φx) -V (Q (r) (Φy) -Φy) 1 + V Φ(x -y) 1 -x -y 2 ≤ V P r u x 1 + V P r u y 1 + V Φ(x -y) 1 -x -y 2 ≤ V P r ∞,1 ( u x ∞ + u y ∞ ) + V Φ(x -y) 1 -x -y 2 ≤ 2Cc(µ) r r r (8r) r+1 π/2λ -r+1/2 + x -y 2 = √ 2πCc(µ) r r r (8r) r+1 λ -r+1/2 + x -y 2 = C(µ, r)λ -r+1/2 + x -y 2 holds uniformly for any x, y ∈ T with probability at least 1 -δ -2|T |e -β . The bound ( 12) is associated with a weaker condition on β due to the associated weaker condition in Lemma C.3.

E COMPARISON WITH PRODUCT QUANTIZATION

Note that the distance preserving quality (as well as performance on retrieval and classification tasks) of MSQ binary embeddings using bilinear projection (Gong et al., 2013) or circulant matrices (Yu et al., 2014) has be shown to be at least as good as product quantization (Jegou et al., 2010) , LSH (Andoni & Indyk, 2006; Shrivastava & Li, 2014) and ITQ (Gong et al., 2012c) . Our method uses Sigma-Delta quantization, which 1. gives provably better error rates than the MSQ design as shown in this paper, and in (Huynh & Saab, 2020) ; 2. is more efficient in terms of both memory and distance query computation as shown in Section 5. In order to more explicitly compare our algorithm with data dependent methods, as an example, we now briefly analyze product quantization as presented in (Jegou et al., 2010) . We then present a brief analysis of optimal data-independent methods as well as data-independent product quantization, in comparison with our method.

DATA-DEPENDENT PRODUCT QUANTIZATION

The key idea here is to decompose the input vector space R n into the Cartesian product of M low-dimensional subspaces R d with n = M d and quantize each subspace into k * codewords, for example by using the k-means algorithm. So the total number of centroids (codewords 2 : Comparison between the proposed method and product quantization per data point A direct comparison of the associated errors is not possible due to the fact that the error associated with data-dependent product quantization is a function of the input data distribution, and the convergence of the k-means algorithm. Nevertheless, one can note some tradeoffs from Table 2 . Namely, the embedding time and the space needed to store our embedding matrix are lower than those associated with product quantization. On the other hand, the space needed to store the embedded data points and the query time associated with product quantization depend on the parameter choices M and k * , which also affect the resulting accuracy. Finally, we note that product quantization (using kmeans clustering) is associated with a pre-processing time O(nN k * t), which is significantly larger than our method. ) in R n is k = (k * ) M

DATA-INDEPENDENT PRODUCT QUANTIZATION AND OPTIMALITY OF OUR METHOD

If one were to just encode, in a data independent way, the 2 ball of R n , so that the encoding error is at most θ, then a simple volume argument shows that one needs at least θ -n codewords, hence n log 2 (1/θ) bits. This lower bound holds, independent of the encoding method, i.e., whether one uses product quantization or any other technique. To reduce the number of bits below n, one approach is to capitalize on the finiteness of the data, and use a JL type embedding (such as random sampling for well-spread data) to reduce the dimension to p ≈ log |T |/ 2 (up to log factors), and therefore introduce a new embedding error of , on top of the encoding error. The advantage is that one would then only need to encode an 2 ball in the p-dimensional space. Again, independently of the encoding method, one would now need p log(1/θ) bits to get an encoding error of θ. If we denote c x , c y , the encoding of x and y, then this gives the error estimate c x -c y -x -y θ + x -y . If we rewrite the error now in terms of the number of bits b = p log(1/θ), we get c x -c y -x -y 2 -b/p + x -y . Note that in all of this, no computational complexity was taken into account. One can envision replacing k-means clustering in product quantization, with a data-independent encoding. With a careful choice of parameters, this may be significantly more computationally efficient than the above optimal encoding, albeit at the expense of a sub-optimal error bound. On the other hand, consider that our computationally efficient scheme uses m bits, and that those m bits can be compressed into b ≈ rp log(m/p) bits (see Section 5), then our error, by Theorem 4.2 is c x -c y -x -y c(m/p) -r+1/2 + x -y , which in rate-distortion terms is c x -c y -x -y 2 -b p r-1/2 r + x -y . In other words, up to constants in the exponent, and possible logarithmic terms, our result is nearoptimal.



Yelp open dataset: https://www.yelp.com/dataset



Figure 1: Plots of 2 distance reconstruction error when r = 1, 2

Figure 2: Plots of 2 distance reconstruction error with fixed p = 64 and optimal p = p(m)

Figure 3: Plot of MAPE of Method 2 on four datasets with fixed p = 64 and order r = 1, 2

and the time complexity of learning all k centroids is O(nN k * t) where N is the number of training data points and t is the number of iterations in the k-means algorithm. Moreover, converting each input vector x ∈ R n to the index of its codeword needs time O(M dk * ) = O(nk * ) and the length of binary codes is m = log 2 k = M log 2 k * . Since we have to store all k centroids and M lookup tables, memory usage is O(M (dk * + (k * ) 2 )) = O(nk * + M (k * ) 2 ). Moreover, the query time, i.e. the time complexity of pairwise distance estimation is O(M k * ) using lookup tables. As a result, we obtain

Yan Xia, Kaiming He, Pushmeet Kohli, and Jian Sun. Sparse projections for high-dimensional binary codes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3332-3339, 2015. Xinyang Yi, Constantine Caramanis, and Eric Price. Binary embedding: Fundamental limits and fast algorithm. In International Conference on Machine Learning, pp. 2162-2170, 2015. Felix Yu, Sanjiv Kumar, Yunchao Gong, and Shih-Fu Chang. Circulant binary embedding. In International conference on machine learning, pp. 946-954, 2014. Xu Zhang, Felix X Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang, and Shi-Fu Chang. Fast orthogonal projection based on kronecker product. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2929-2937, 2015.

Table 2, whose column headings are analogous to those in Table 1.

ACKNOWLEDGMENTS

Our work was supported in part by NSF Grant DMS-2012546 and a UCSD senate research award. The authors would like to thank Sjoerd Dirksen for inspiring discussions and suggestions.

annex

Having introduced stable Σ∆ quantization, we now present a lemma controlling an operator norm of V P r . We will need this result in controlling the error in approximating distances associated with our binary embedding. Lemma C.2. For a stable r-th order Σ∆ quantization scheme,Proof. By the same method used in the proof of Lemma 4.6 in Huynh & Saab (2020) , one can getIt follows thatThe following result guarantees that the linear part of our embedding generates a bounded vector, and therefore allows us to later appeal to the stability property of Σ∆ quantizers. In other words, it will allow us to use ( 22) to control the infinity norm of state vectors generated by Σ∆ quantization. Lemma C.3 (Concentration inequality for • ∞ ). Let β > 0, ∈ (0, 1), A ∈ R m×n be the sparse Gaussian matrix in Definition 2.2 with s = Θ( -1 n -1 ) ≤ 1, and Φ = AHD ∈ R m×n be the FJLT in Definition 2.4 with s = Θ( -1 n -1 log n) ≤ 1. Suppose thatholds for x ∈ R n with x ∞ = O(n -1/2 x 2 ) andProof. Without loss of generality, we can assume that x is a unit vector with x 2 = 1. We start with the proof of (25). By applying Lemma B.2 to x with λ = Θ( log n/n), we haveLet A be as in Definition 2.2 with s = 2λ 2 / = Θ( -1 n -1 log n) ≤ 1 and recall that Φ = AHD.Suppose that y ∈ R n with y 2 = 1 and y, we get t 2 y 2 j /2s ≤ 1 for all j. Since e x ≤ 1+2x for all x ∈ [0, 1] and 1+x ≤ e x for all x ∈ R, se t 2 y 2 j /2s +1-s ≤ s(1+t 2 y 2 j /s)+1-s = 1 + t 2 y 2 j ≤ e t 2 y 2 j . It follows thate t 2 y 2 j = e t 2 holds for all 1 ≤ i ≤ m and t ∈ [0, t 0 ]. So for t ∈ [0, t 0 ], by Markov inequality and above inequality we have P Y i ≥ µ = P e tYi ≥ e tµ ≤ e -tµ E(e tYi ) ≤ e -tµ+t 2 .According to (23) we can set t = µ/2 ≤ t 0 = 2/ √ , then P Y i ≥ µ ≤ e -µ 2 /4 . By symmetry we have P -Y i ≥ µ ≤ e -µ 2 /4 . Consequently, for all 1 ≤ i ≤ m we have

