FAST PARTIAL FOURIER TRANSFORM

Abstract

Given a time-series vector, how can we efficiently compute a specified part of Fourier coefficients? Fast Fourier transform (FFT) is a widely used algorithm that computes the discrete Fourier transform in many machine learning applications. Despite the pervasive use, FFT algorithms do not provide a fine-tuning option for the user to specify one's demand, that is, the output size (the number of Fourier coefficients to be computed) is algorithmically determined by the input size. Such a lack of flexibility is often followed by just discarding the unused coefficients because many applications do not require the whole spectrum of the frequency domain, resulting in an inefficiency due to the extra computation. In this paper, we propose a fast Partial Fourier Transform (PFT), an efficient algorithm for computing only a part of Fourier coefficients. PFT approximates a part of twiddle factors (trigonometric constants) using polynomials, thereby reducing the computational complexity due to the mixture of many twiddle factors. We derive the asymptotic time complexity of PFT with respect to input and output sizes, as well as its numerical accuracy. Experimental results show that PFT outperforms the current state-of-the-art algorithms, with an order of magnitude of speedup for sufficiently small output sizes without sacrificing accuracy.

1. INTRODUCTION

How can we efficiently compute a specified part of Fourier coefficients of a given time-series vector? Discrete Fourier transform (DFT) is a crucial task in several application areas, including anomaly detection (Hou & Zhang (2007) ; Rasheed et al. (2009) ; Ren et al. (2019) ), data center monitoring (Mueen et al. (2010) ), and image processing (Shi et al. (2017) ). Notably, in many such applications, it is well known that the DFT results in strong "energy-compaction" or "sparsity" in the frequency domain. That is, the Fourier coefficients of data are mostly small or equal to zero, having a much smaller support compared to the input size. Moreover, the support can often be specified in practice (e.g., a few low-frequency coefficients around the origin). These observations arouse a great interest in an efficient algorithm capable of computing only a specified part of Fourier coefficients. Accordingly, various approaches have been proposed to address the problem, which include Goertzel algorithm (Burrus & Parks (1985) ), Subband DFT (Hossen et al. (1995); Shentov et al. (1995) ), and Pruned FFT (Markel (1971) ; Skinner (1976) ; Nagai (1986) ; Sorensen & Burrus (1993) ; Ailon & Liberty (2009) ). In this paper, we propose a fast Partial Fourier Transform (PFT), an efficient algorithm for computing a part of Fourier coefficients. Specifically, we consider the following problem: given a complex-valued vector a of size N , a non-negative integer M , and an integer µ, estimate the Fourier coefficients of a for the interval [µ -M, µ + M ]. The resulting algorithm is of remarkably simple structure, composed of several "smaller" FFTs combined with linear pre-and post-processing steps. Consequently, PFT reduces the number of operations to O(N + M log M ), which is, to the best of our knowledge, the lowest arithmetic complexity achieved so far. Besides that, most subroutines of PFT are the already highly optimized algorithms (e.g., matrix multiplication and FFT), thus the arithmetic gains are readily turned into actual run-time improvements. Furthermore, PFT does not require the input size to be a power of 2, unlike many other competitors. This is because the idea of PFT derives from a modification of the Cooley-Tukey algorithm (Cooley & Tukey, 1965) , which also makes it straightforward to extend the idea to a higher dimensionality. Through experiments, we show that PFT outperforms the state-of-the-art FFT libraries, FFTW by Frigo & Johnson (2005) and Intel Math Kernel Library (MKL) as well as Pruned FFTW, with an order of magnitude of speedup without sacrificing accuracy.

2. RELATED WORK

We describe various existing methods for computing partial Fourier coefficients. Fast Fourier transform. One may consider just using Fast Fourier transform (FFT) and discarding the unnecessary coefficients, where FFT efficiently computes the full DFT, reducing the arithmetic cost from naïve O(N 2 ) to O(N log N ). Such an approach has two major advantages: (1) it is straightforward to implement, and (2) the method often outperforms the competitors because it directly employs FFT which has been highly optimized over decades. Therefore, we provide extensive comparisons of PFT and FFT both theoretically and through run-time evaluations. Experimental results in Section 4.2 show that PFT outperforms the FFT when the output size is small enough (< 10%) compared to the input size. Goertzel algorithm. Goertzel algorithm (Burrus & Parks (1985) ) is one of the first methods devised for computing only a part of Fourier coefficients. The technique is essentially the same as computing the individual coefficients of DFT, thus requiring O(M N ) operations for M coefficients of an input of size N . Specifically, theoretical analysis represents "the M at which the Goertzel algorithm is advantageous over FFT" as M < 2 log N (Sysel & Rajmic (2012) ). For example, with N = 2 22 , the Goertzel algorithm becomes faster than FFT only when M < 44, while PFT outperforms FFT for M < 2 19 = 524288 (Figure 1b ). A few variants which improve the Goertzel algorithm have been proposed (e.g., Boncelet (1986) ). Nevertheless, the performance gain is only by a small constant factor, thus they are still limited to rare scenarios where a very few number of coefficients are required. Subband DFT. Subband DFT (Hossen et al. (1995); Shentov et al. (1995) ) consists of two stages of algorithm: Hadamard transform that decomposes the input sequence into a set of smaller subsequences, and correction stage for recombination. The algorithm approximates a part of coefficients by eliminating subsequences with small energy contribution, and manages to reduce the number of operations to O(N + M log N ). Apart from the arithmetic gain, however, there is a substantial issue of low accuracy with the Subband DFT. Indeed, experimental results in Hossen et al. (1995) show that the relative approximation error of the method is around 10 -1 (only one significant figure) regardless of output size. Moreover, the Fourier coefficients can be evaluated to arbitrary numerical precision with PFT, which is not the case for Subband DFT. Such limitations often preclude one from considering the Subband DFT in applications that require a certain degree of accuracy. Pruned FFT. FFT pruning (Markel (1971) ; Skinner (1976) ; Nagai (1986) ; Sorensen & Burrus (1993) ; Ailon & Liberty (2009) ) is another technique for the efficient computation of partial Fourier coefficients. The method is a modification of the standard split-radix FFT, where the edges (operations) in a flow graph are pruned away if they do not affect the specified range of frequency domain. Besides being almost optimized (it uses FFT as a subroutine), the FFT pruning algorithm is exact and reduces the arithmetic cost to O(N log M ). Thus, along with the full FFT, the pruned FFT is reasonably the most appropriate competitor of PFT. Through experiments (Section 4.2), we show that PFT consistently outperforms the pruned FFT, significantly extending the range of output sizes for which partial Fourier transform becomes practical. Finally, we mention that there have been other approaches but with different settings. For example, Hassanieh et al. (2012a; b) and Indyk et al. (2014) propose Sparse Fourier transform, which estimates the top-k (the k largest in magnitude) Fourier coefficients of a given vector. The algorithm is useful especially when there is prior knowledge of the number of non-zero coefficients in frequency domain. Note that our setting does not require any prior knowledge of the given data. Applications of FFT. We outline various applications of Fast Fourier transform, to which partial Fourier transform can potentially be applied. FFT has been widely used for anomaly detection (Hou & Zhang (2007) ; Rasheed et al. (2009); Ren et al. (2019) ). Hou & Zhang (2007) and Ren et al. (2019) detect anomalous points of a given data by extracting a compact representation with FFT. Rasheed et al. (2009) use FFT to detect local spatial outliers which have similar patterns within a region but different patterns from the outside. Several works (Pagh (2013) ; Pham & Pagh (2013) ; Malik & Becker (2018)) exploit FFT for efficient operations. Pagh (2013) leverages FFT to efficiently compute a polynomial kernel used with support vector machines (SVMs). Malik & Becker (2018) propose an efficient tucker decomposition method using FFT. In addition, FFT has been used for fast training of convolutional neural networks (Mathieu et al. (2014) ; Rippel et al. (2015) ) and an efficient recommendation model on a heterogeneous graph (Jin et al. (2020) ).

3.1. OVERVIEW

We propose PFT, an efficient algorithm for computing a specified part of Fourier coefficients. The main challenges and our approaches are as follows: 1. How can we extract essential information for a specified output? Considering that only a specified part of Fourier coefficients should be computed, we need to find an algorithm requiring fewer operations than the direct use of conventional FFT. This is achievable by carefully modifying the Cooley-Tukey algorithm, finding twiddle factors (trigonometric factors) with small oscillations, and approximating them via polynomials (Section 3.2.1). 2. How can we reduce approximation cost? The approach given above involves an approximating process, which would be computationally demanding. We propose using a base exponential function, by which all data-independent factors can be precomputed, enabling one to bypass the approximation problem during the run-time (Sections 3.2.2 and 3.3). 3. How can we further reduce numerical computation? We carefully reorder operations and factorize terms in order to alleviate the complexity of PFT. Such techniques separate all data-independent factors from data-dependent factors, allowing further precomputation. The arithmetic cost of the resulting algorithm is O(N + M log M ), where N and M are input and output size descriptors, respectively (Sections 3.4 and 3.5.1).

3.2. APPROXIMATION OF TWIDDLE FACTORS

The key of our algorithm is to approximate a part of twiddle factors with small oscillations by using polynomial functions, reducing the computational complexity of DFT due to the mixture of many twiddle factors. Using polynomial approximation also allows one to carefully control the degree of polynomial (or the number of approximating terms), enabling fine-tuning the output range and the approximation bound of the estimation. Our first goal is to find a collection of twiddle factors with small oscillations. This can be achieved by slightly adjusting the summand of DFT and splitting the summation as in the Cooley-Tukey algorithm (Section 3.2.1). Next, using a proper base exponential function, we give an explicit form of approximation to the twiddle factors (Section 3.2.2).

3.2.1. TWIDDLE FACTORS WITH SMALL OSCILLATIONS

Recall that the DFT of a complex-valued vector a of size N is defined as follows: âm = n∈[N ] a n e -2πimn/N , where [ν] denotes {0, 1, • • • , ν -1} for a positive integer ν (in this paper, we follow the convention of viewing a vector v = (v 0 , v 1 , • • • , v ν-1 ) of size ν as a finite sequence defined on [ν]). Assume that N = pq for two integers p, q > 1. The Cooley-Tukey algorithm re-expresses (1) as âm = k∈[p] l∈[q] a qk+l e -2πim(qk+l )/N = k∈[p] l∈[q] a qk+l e -2πiml/N • e -2πimk/p , yielding two collections of twiddle factors, namely {e -2πiml/N } l∈[q] and {e -2πimk/p } k∈[p] . Consider the problem of computing âm for -M ≤ m ≤ M , where M ≤ N/2 is a nonnegative integer. In this case, note that the exponent of e -2πiml/N ranges from -2πiM (q -1)/N to +2πiM (q -1)/N and that the exponent of e -2πimk/p ranges from -2πiM (p -1)/p to +2πiM (p -1)/p. Here (q-1)/N (p-1)/p ∼ 1 p , meaning that the first collection contains twiddle factors with smaller oscillations compared to the second one. Typically, a function with smaller oscillation results in a better approximation via polynomials. In this sense, it is reasonable to approximate the first collection of twiddle factors in (2) with polynomial functions, thereby reducing the complexity of the computation due to the mixture of two collections of twiddle factors. Indeed, one can further reduce the complexity of approximation: we slightly adjust the summand in (2) as follows. âm = e -πim/p k∈[p] l∈[q] a qk+l e -2πim(l-q/2)/N • e -2πimk/p . (3) In (3), we observe that the range of exponents of the first collection {e -2πim(l-q/2)/N } l∈[q] of twiddle factors is [-πiM/p, +πiM/p], a contraction by a factor of around 2 when compared with [-2πiM (q -1)/N, +2πiM (q -1)/N ], hence the twiddle factors with even smaller oscillations. There is an extra twiddle factor e -πim/p in (3). Note that, however, it depends on neither k nor l, so the amount of the additional computation is relatively small.

3.2.2. BASE EXPONENTIAL FUNCTION

The first collection of twiddle factors in (3) consists of q distinct exponential functions. One can apply approximation process for each function in the collection; however, this would be timeconsuming. A more plausible approach is to 1) choose a base exponential function e uix for a fixed u ∈ R, 2) approximate e uix using a polynomial, and 3) exploit a property of exponential functions: the laws of exponents. Specifically, suppose that we obtained a polynomial P(x) that approximates e uix on |x| ≤ |ξ|, where u, ξ ∈ R \ {0}. Consider another exponential function e vix , where v = 0. Since e vix = e ui(vx/u) , the re-scaled polynomial P(vx/u) approximates e vix on |x| ≤ |uξ/v|. This observation indicates that once we find an approximation P to e uix on |x| ≤ |ξ| for properly selected u and ξ, all elements belonging to {e -2πim(l-q/2)/N } l∈[q] can be approximated by re-scaling P. Fixing a base exponential function also enables precomputing a polynomial that approximates it, so that one can bypass the approximation problem during the run-time. We further elaborate this idea in a rigorous manner after giving a few definitions (see Definitions 3.1 and 3.2). Let • R be the uniform norm (or supremum norm) restricted to a set R ⊆ R, that is, f R = sup{|f (x)| : x ∈ R} and P α be the set of polynomials on R of degree at most α. Definition 3.1. Given a non-negative integer α and non-zero real numbers ξ, u, we define a polynomial P α,ξ,u as the best approximation to e uix out of the space P α under the restriction |x| ≤ |ξ|: P α,ξ,u := arg min P ∈Pα P (x) -e uix |x|≤|ξ| , and P α,ξ,u = 1 when ξ = 0 or u = 0. Smirnov & Smirnov (1999) proved the unique existence of P α,ξ,u . A few techniques called minimax approximation algorithms for computing the polynomial are reviewed in Fraser (1965) . Definition 3.2. Given a tolerance > 0 and a positive integer r ≥ 1, we define ξ( , r) to be the scope about the origin such that the exponential function e πix can be approximated by a polynomial of degree less than r with approximation bound : ξ( , r) := sup{ξ ≥ 0 : P r-1,ξ,π (x) -e πix |x|≤ξ ≤ }. We express the corresponding polynomial as P r-1,ξ( ,r),π (x) = j∈[r] w ,r-1,j • x j . In Definition 3.2, we choose e πix as a base exponential function. The rationale behind is as follows. First, using a minimax approximation algorithm, we precompute ξ( , r) and {w ,r-1,j } j∈ [r] for several tolerance 's (e.g. 10 -1 , 10 -2 , • • • ) and positive integer r's (typically 1 ≤ r ≤ 25). When N, M, p and are given, we find the minimum r satisfying ξ( , r) ≥ M/p. Then, by the preceding argument, it follows that the re-scaled polynomial function P r-1,ξ( ,r),π (-2x(l -q/2)/N ) approximates e -2πix(l-q/2)/N on |x| ≤ | N 2(l-q/2) • M p | for each l ∈ [q] (note that if l -q/2 = 0, we have | N 2(l-q/2) • M p | = ∞). Here | N 2(l-q/2) • M p | = | q 2l-q • M | ≥ M for all l ∈ [q] . Therefore, we obtain a polynomial approximation on |m| ≤ M for each twiddle factor in {e -2πim(l-q/2)/N } l∈ [q] , namely {P r-1,ξ( ,r),π (-2m(l -q/2)/N )} l∈[q] . Then, it follows from (3) that âm ≈ e -πim/p k∈[p] l∈[q] a qk+l P r-1,ξ( ,r),π (-2m(l -q/2)/N ) • e -2πimk/p , which gives an estimation of the coefficient âm for -M ≤ m ≤ M .

3.3. ARBITRARILY CENTERED TARGET RANGES

In the previous section, we have focused on the problem of calculating âm for m belonging to [-M, M ]. We now consider a more general case: let us use the term target range to indicate the range where the Fourier coefficients should be calculated, and R µ,M to denote [µ -M, µ + M ] ∩ Z, where µ ∈ Z. Note that the previously given method works only when our target range is centered at µ = 0. A slight modification of the algorithm allows the target range to be arbitrarily centered. One possible approach is as follows: given a complex-valued vector x of size N , we define y as y n = x n • e -2πiµn/N . Then, the Fourier coefficients of x and y satisfy the following relationship: ŷm = n∈[N ] x n • e -2πiµn/N • e -2πimn/N = n∈[N ] x n • e -2πi(m+µ)n/N = xm+µ . Thus, the problem of calculating xm for m ∈ R µ,M is equivalent to calculating ŷm for m ∈ R 0,M , to which our previous method can be applied. This technique, however, requires extra N multiplica-tions due to the computation of y. A better approach, where one can bypass the extra process during the run-time, is to exploit the following lemma (see Appendix A.1 for the proof). Lemma 1. Given a non-negative integer α, non-zero real numbers ξ, u, and a real number µ, the following equality holds: e uiµ • P α,ξ,u (x -µ) = arg min P ∈Pα P (x) -e uix |x-µ|≤|ξ| . This observation implies that, in order to obtain a polynomial approximating e uix on |x -µ| ≤ |ξ|, we first find a polynomial P approximating e uix on |x| ≤ |ξ|, then translate P by -µ and multiply it with the scalar e uiµ . Applying this process to the previously obtained approximation polynomials (see Section 3.2.2) yields {e -2πiµ(l-q/2)/N • P r-1,ξ( ,r),π (-2(m -µ)(l -q/2)/N )} l∈ [q] . We substitute these polynomials for the twiddle factors {e -2πim(l-q/2)/N } l∈[q] in (3), which gives the following estimation of âm for m ∈ R µ,M , where k ∈ [p], l ∈ [q], and j ∈ [r]: e -πim/p k,l a qk+l e -2πiµ(l-q/2)/N • P r-1,ξ( ,r),π (-2(m -µ)(l -q/2)/N ) • e -2πimk/p = e -πim/p k,l a qk+l e -2πiµ(l-q/2)/N j w ,r-1,j (-2(m -µ)(l -q/2)/N ) j • e -2πimk/p = e -πim/p j k,l a qk+l e -2πiµ(l-q/2)/N w ,r-1,j ((m -µ)/p) j (1 -2l/q) j • e -2πimk/p . (5)

3.4. EFFICIENT SUMMATIONS

We have found that three main summation steps (each being over j, k and l) take place when computing the partial Fourier coefficients. Note that in (5), the innermost summation j is moved to the outermost position, and the term -2(m -µ)(l -q/2)/N is factorized into two independent terms, (m -µ)/p and 1 -2l/q. Interchanging the order of summations and factorizing the term result in a significant computational benefit; we elucidate what operator we should utilize for each summation and how we can save the arithmetic costs from it. As we will see, the innermost sum over l corresponds to a matrix multiplication, the second sum over k can be viewed as multiple DFTs, and the outermost sum over j is an inner product. For the first sum, let A = (a k,l ) = a qk+l and B = (b l,j ) = e -2πiµ(l-q/2)/N w ,r-1,j (1 -2l/q) j , so that (5) can be written as follows: e -πim/p j∈[r] ((m -µ)/p) j k∈[p] e -2πimk/p l∈[q] a k,l b l,j . Here, note that the matrix B is data-independent (not dependent on a), and thus can be precomputed. Indeed, we have already seen that {w ,r-1,j } j∈[r] can be precomputed. The other factors e -2πiµ(l-q/2)/N and (1 -2l/q) j composing the elements of B can also be precomputed if (N, M, µ, p, ) is known in advance. Thus, as long as the setting (N, M, µ, p, ) is unchanged, we can reuse the matrix B for any input data a once the configuration phase of PFT is completed (Algorithm 1). We shall denote the multiplication A × B as C = (c k,j ): e -πim/p j∈[r] ((m -µ)/p) j k∈[p] c k,j • e -2πimk/p . For each j ∈ [r], the summation ĉm,j = k∈[p] c k,j •e -2πimk/p is a DFT of size p. We perform FFT r times for this computation, which yields the following estimation of âm for m ∈ R µ,M : e -πim/p j∈[r] ((m -µ)/p) j • ĉm,j . Note that ĉm,j is a periodic function of period p with respect to m, so we use the coefficient at m modulo p if m < 0 or m ≥ p. Thus, the m th Fourier coefficient of a can be estimated by the inner product of ((m -µ)/p) j and ĉm,j with respect to j, followed by a multiplication with the extra twiddle factor e -πim/p (we also precompute ((m -µ)/p) j and e -πim/p ). The full computation is outlined in Algorithm 2. By these summation techniques, the arithmetic complexity is reduced to O(N + M log M ) from naïve O(M N ), as described in Section 3.5.

3.5. THEORETICAL ANALYSIS

We present theoretical analysis on the time complexity of PFT and its approximation bound. Algorithm 1: Configuration phase of PFT input : Input size N , output descriptors M and µ, divisor p, and tolerance output: Matrix B, divisor p, and numbers of rows and columns, q and r 1 q ← N/p 2 r ← min{r ∈ N : ξ( , r) ≥ M/p} // Use precomputed ξ( , r) 3 for (l, j) ∈ [q] × [r] do 4 B[l, j] ← e -2πiµ(l-q/2)/N • w ,r-1,j • (1 -2l/q) j // Use precomputed w ,r-1,j 5 end Algorithm 2: Computation phase of PFT input : Vector a of size N , output descriptors M and µ, and configuration results B, p, q, r output: Vector E(â) of estimated Fourier coefficients of a for [µ -M, µ + M ] 1 A[k, l] ← a qk+l for k ∈ [p] and l ∈ [q] 2 C ← A × B 3 for j ∈ [r] do 4 Ĉ[•, j] ← FFT(C[•, j]) // FFT of j-th column of C 5 end 6 for m ∈ [µ -M, µ + M ] do 7 E(â)[m] ← e -πim/p r-1 j=0 ((m -µ)/p) j • Ĉ[m%p, j] 8 end

3.5.1. TIME COMPLEXITY

We analyze the time complexity of PFT. Theorem 2 (see Appendix A.2 for the proof) shows that the time cost T (N, M ) of PFT, where N and M are input and output size descriptors, respectively, is bounded by O(N + M log M ). Note that the theorem presumes that all prime factors of N have a fixed upper bound. Yet, in practice, this necessity is not a big concern because one can readily control the input size with basic techniques such as zero-padding or re-sampling. Moreover, we empirically find that even if N has a large prime factor, PFT still shows a promising performance (see Section 4.2). In Theorem 2, note that a positive integer is called b-smooth if none of its prime factors is greater than b. For example, the 2-smooth integers are equivalent to the powers of 2. Theorem 2. Fix a tolerance > 0 and an integer b ≥ 2. If N is b-smooth, then the time complexity T (N, M ) of PFT has an asymptotic upper bound O(N + M log M ).

3.5.2. APPROXIMATION BOUND

We now give a theoretical approximation bound of the estimation via the polynomial P. We denote the estimated Fourier coefficient of a as E(â). Theorem 3 (see Appendix A.3 for the proof) states that the approximation bound over the target range is data-dependent of the total weight a 1 of the original vector and the given tolerance , where • 1 denotes the 1 norm. Theorem 3. Given a tolerance > 0, the following inequality holds for PFT: â -E(â) R µ,M ≤ a 1 • .

4. EXPERIMENTS

Through experiments, the following questions should be answered: • Q1. Run-time cost (Section 4.2). How quickly does PFT compute a part of Fourier coefficients compared to other competitors without sacrificing accuracy? • Q2. Effect of hyper-parameter p (Section 4.3). How do the different choices of divisor p of input size N affect the overall performance of PFT? • Q3. Effect of different precision (Section 4.4). How do the different precision settings affect the run-time of PFT? • Q4. Anomaly detection (Section 4.5). How well does PFT work for a practical application employing FFT (anomaly detection)? 4.1 EXPERIMENTAL SETUP Machine. A machine with Intel Core i7-6700HQ @ 2.60GHz and 8GB of RAM is used. Datasets. We use both synthetic and real-world datasets listed in Table 1 . Real-world 32000 Various sound recordings in urban environment Air Conditionfoot_1 Real-world 19735 Time-series vectors of air condition information Competitors. We compare PFT with two state-of-the-art FFT algorithms, FFTW and MKL, as well as Pruned FFTW. All of them are implemented in C++. 1. FFTW: FFTWfoot_2 is one of the fastest public implementation for FFT, offering a hardwarespecific optimization. We use the optimized version of FFTW 3.3.5, and do not include the pre-processing for the optimization as the run-time cost. 2. MKL: Intel Math Kernel Libraryfoot_3 (MKL) is a library of optimized math routines including FFT, and often shows a better run time result than the FFTW. All the experiments are conducted with an Intel processor for the best performance. 3. pFFT: pFFTfoot_4 is a pruned version of FFTW designed for fast computation of a subset of the outputs. The algorithm uses the optimized FFTW as a subroutine. 4. PFT (proposed): we use MKL BLAS routines for the matrix multiplication, MKL DFTI functions for the batch FFT computation, and Intel Integrated Performance Primitives (IPP) library for the post-processing steps such as inner product and element-wise multiplication. Measure. In all experiments, we use single-precision floating-point format, and the parameters p and are chosen so that the relative 2 error is strictly less than 10 -6 , which ensures that the overall estimated coefficients have at least 6 significant figures. Explicitly, Relative 2 Error = m∈R |â m -E(â) m | 2 m∈R |â m | 2 < 10 -6 , where â is the actual coefficient, E(â) is the estimation of â, and R is the target range. Section 4.4 is an exception, where we investigate different settings, varying the precision to 10 -4 or 10 -2 .

4.2. RUN-TIME COST

Run time vs. input size. We fix the target range to R 0,2 9 and evaluate the run time of PFT vs. input sizes N : 2 12 , 2 13 , • • • , 2 22 . Figure 1a shows how the four competitive algorithms scale with varying input size, wherein PFT outperforms the others if the output size is sufficiently smaller (< 10%) than the input size. Consequently, PFT achieves up to 19× speedup compared to its competitors. Due to the overhead of the O(N ) pre-and O(M ) post-processing steps, PFT runs slower than FFT when M is close to N so the time complexity tends to O(N + N log N ). Run time vs. output size. We fix the input size to N = 2 22 and evaluate the run time of PFT vs. target ranges R 0,2 9 , R 0,2 10 , • • • , R 0,2 18 . The result is illustrated as a run time vs. output size plot (recall that |R 0,M | 2M ) in Figure 1b . Note that the run times of FFTW and MKL do not benefit from the information of output size. We also observe that the pruned FFT (pFFT) shows only a modest improvement compared to the full FFTs, while PFT significantly extends the range of output sizes for which partial Fourier transform becomes practical. Real-world data. When it comes to real-world data, it is not generally the case that the size of an input vector is a power of 2. Notably, PFT still shows a promising performance regardless of the fact that the input size is not a power of 2 or even it has a large prime factor: a strong indication that our proposed technique is robust for many different applications in real-world. Urban Sound dataset contains various sound recording vectors of size N = 32000 = 2 8 × 5 3 . We evaluate the run time of PFT vs. output size ranging from 100 to 6400. Figure 2a illustrates the result, wherein PFT outperforms the competitors if the output size is small enough compared to the input size. On the other hand, Air Condition dataset is composed of time-series vectors of size N = 19735 = 5×3947. Note that N has only two non-trivial divisors, namely 5 and 3947, forcing one to choose p = 3947 in any practical settings; if we choose p = 5, the ratio M/p often turns out to be too large, which Under review as a conference paper at ICLR 2021 (a) Ο Ο Ο Ο Ο Ο Ο Ο Ο Ο Ο □ □ □ □ □ □ □ □ □ □ □ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ Ο PFT (proposed) □ pFFT Ｘ FFTW ＋ MKL 19 × 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 0.01 0.1 1. 10.

Input size

Run time

(ms)

Run time vs. Input size (Target range: R 0, 2 9 ) (b) 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 n=12 datasets, and (b) run time vs. output size for S 22 . PFT runs faster than the competitors if the output size is small enough (< 10%) compared to the input size. Note that PFT consistently outperforms the pruned FFT. Ο Ο Ο Ο Ο Ο Ο Ο Ο Ο Ο □ □ □ □ □ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ Ο PFT □ pFFT Ｘ FFTW ＋ MKL 10.2 × 2 10 (a) Ο Ο Ο Ο Ο Ο Ο □ □ □ □ □ □ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ ＋ ＋ ＋ ＋ ＋ ＋ ＋ Ο PFT □ pFFT Ｘ FFTW ＋ MKL 5.2 × 100 200 400 800 1600 3200 6400 2 -7 2 -6 2 -5 2 -4 2 -3 Output size Run time (ms) Run time vs. Output size (N=32000) results in a poor performance (see Section 4.3 for more discussion of the optimal choice of p). The run time of PFT vs. output size ranging from 125 to 16000 is evaluated in Figure 2b (pFFT is not included in the figure since it consistently runs slower than FFTW). It is noteworthy that PFT still outperforms its competitors even in such pathological examples, implying the robustness of our algorithm for various real-world situations. (b) Ο Ο Ο Ο Ο Ο Ο Ο Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ Ｘ ＋ ＋ ＋ ＋ ＋ ＋ ＋ ＋ Ο PFT Ｘ FFTW ＋ MKL

4.3. EFFECT OF HYPER-PARAMETER p

To investigate the effect of different choices of p, we fix N = 2 22 and vary the ratio M/p from 1/32 to 4 for target ranges R 0,2 9 , R 0,2 10 , • • • , R 0,2 18 . Table 2 shows the resulting run times, where the bold highlights the best choice of M/p for each M , and the missing entries are due to worse performance than FFT. One crucial observation is as follows: with the increase of output size, the best choice of M/p also increases or, equivalently, the optimal value of p tends to remain stable. Intuitively, this is the consequence of "balancing" the three summation steps (Section 3.4): when M N , the most computationally expensive operation is the matrix multiplication with O(rN ) time cost, and thus, M/p should be small so that the r decreases, despite a sacrifice in the batch FFT step requiring O(rp log p) operations (Appendix A.2). As the M becomes larger, however, more concern is needed regarding the batch FFT and post-processing steps, so the parameter p should not change rapidly. This observation indicates the possibility that the optimal value of p can be algorithmically auto-selected given a setting (N, M, µ, ), which we leave as a future work.

4.4. EFFECT OF DIFFERENT PRECISION

We investigate the trade-off between accuracy and running time of PFT. To do this, we fix N = 2 22 and change the precision goal from 10 -6 to 10 -4 or 10 -2 for target ranges R 0,2 9 , R 0,2 10 , • • • , R 0,2 18 . Table 3 shows the results, where the parenthesized number is the ratio of run times of each setting to 10 -6 . Note that the run time of PFT is reduced by up to 17% or 27% when the precision goal is set to 10 -4 or 10 -2 , respectively. This observation indicates that one may readily benefit from the trade-off, especially when the fast evaluation is of utmost importance albeit with a slight sacrifice in accuracy. 

4.5. ANOMALY DETECTION

We demonstrate an example of how PFT is applied to practical applications. Here is one simple but fundamental principle: replace the "perform FFT and discard unused coefficients" procedure with "just perform PFT". Considering the anomaly detection method proposed in Rasheed et al. (2009) , where one first performs FFT and then inverse FFT with only a few low-frequency coefficients to obtain an estimated fitted curve, we can directly apply the principle to the method. To verify this experimentally, we use a time-series vector from Air Condition dataset, and set the target range as R 0,125 ( 250 low-frequency coefficients). Note that, in this setting, PFT results in around 8× speedup compared to the conventional FFT (see Figure 2b ). The top-20 anomalous points detected from the data are presented in Figure 3 . In particular, we found that replacing FFT with PFT does not change the result of top-20 anomaly detection, with all its computational benefits.

5. CONCLUSIONS

In this paper, we propose PFT (fast Partial Fourier Transform), an efficient algorithm for computing a part of Fourier coefficients. PFT approximates some of twiddle factors with relatively small oscillations using polynomials, reducing the computational complexity of DFT due to the mixture of many twiddle factors. Experimental results show that PFT outperforms the state-of-the-art FFTs as well as pruned FFT, with an order of magnitude of speedup without accuracy loss, significantly extending the range of applications where partial Fourier transform becomes practical. Future works include optimizing the implementation of PFT; for example, the optimal divisor p of input size N might can be algorithmically auto-selected. We also believe that hardware-specific optimizations would further increase the performance of PFT. Exploiting the above property, we manage to reduce the time complexity of PFT to a functional form dependent of only N and M . We follow the convention in counting FFT operations, assuming that all data-independent elements such as configuration results B, p, q, r and twiddle factors are precomputed, and thus not included in the run-time cost. We begin with the construction of the matrix A. For this, we merely interpret a as an array representation for A of size p × q = N (line 1 in Algorithm 2). Also, recall that the matrix B can be precomputed as described in Section 3.4. For the two matrices A of size p × q and B of size q × r, standard matrix multiplication algorithm has running time of O(pqr) = O(r • N ) (line 2 in Algorithm 2). Next, the expression (6) contains r DFTs of size p, namely ĉm,j = k∈[p] c k,j • e -2πimk/p for each j ∈ [r]. We use FFT r times for the computation, then it is easy to see that the time cost is given by O(r • p log p) (lines 3-5 in Algorithm 2). Finally, there are 2M + 1 coefficients to be calculated in (7), each requiring O(r) operations, giving an upper bound O(r • M ) for the running time (lines 6-8 in Algorithm 2). Combining the three upper bounds, we formally express the time complexity T (N, M ) of PFT as T (N, M ) = O(r • (N + p log p + M )). Note that r is only dependent of and M/p by its definition (Algorithm 1). Therefore, when is fixed, T (N, M ) is dependent of the choice of p. By the preceding argument, we can always find a divisor p of N such that M/ √ b ≤ p < √ bM , implying that M/p is tightly bounded, and thus, so is r. It follows that p = Θ(M ) and r = Θ(1), which leads to the following asymptotic upper bound for T (N, M ) with respect to N and M : T (N, M ) = O(N + M log M ), hence the proof.



https://urbansounddataset.weebly.com/urbansound8k.html https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction http://www.fftw.org/index.html http://software.intel.com/mkl http://www.fftw.org/pruned.html



Figure 1: (a) Run time vs. input size for target range R 0,2 9 with {S n } 22n=12 datasets, and (b) run time vs. output size for S 22 . PFT runs faster than the competitors if the output size is small enough (< 10%) compared to the input size. Note that PFT consistently outperforms the pruned FFT.

Figure 2: Run time vs. output size results for (a) Urban Sound dataset and (b) Air Condition dataset.PFT outperforms the competitors regardless of the fact that the input size is not a power of 2 (N = 2 8 × 5 3 ) or even it has a large prime factor (N = 5 × 3947).

Figure 3: Top-20 anomalous points detected in Air Condition time-series data, where each red dot denotes a detected anomaly position. Note that replacing FFT with PFT does not change the result of the detection, still reducing the overall time complexity.

Proof. Let Q = arg min P ∈Pα P (x) -e uix |x-µ|≤|ξ| . We first observe the following equation: Q = arg min P ∈Pα P (x) -e uix |x-µ|≤|ξ| = arg min P ∈Pα P (x + µ) -e ui(x+µ) |x|≤|ξ| = arg min P ∈Pα e -uiµ • P (x + µ) -e uix |x|≤|ξ| ,where the third equality holds since |e -uiµ | = 1. Recall that the polynomial P α,ξ,u is defined byarg min P ∈Pα P (x) -e uix |x|≤|ξ| . If Q(x) ∈ P α , it is clear that e -uiµ • Q(x + µ) ∈ P α because translation and non-zero scalar multiplication on a polynomial do not change its degree. Therefore, by the uniqueness of the best approximation(Smirnov & Smirnov, 1999), we have e -uiµ • Q(x + µ) = P α,ξ,u (x), which yields Q(x) = e uiµ • P α,ξ,u (x -µ), and hence the proof.A.2 PROOF OF THEOREM 2Proof. We first claim that the following statement holds: let b ≥ 2; if N is b-smooth and M ≤ N is a positive integer, then there exists a positive divisorp of N satisfying M/ √ b ≤ p < √ bM . Indeed, suppose that none of N 's divisors belongs to [M/ √ b, √ bM ). Let 1 = p 1 < p 2 < • • • < p d =N be the enumeration of all positive divisors of N in increasing order. It is clear that p 1 < √ bM and M/ √ b < p d since b ≥ 2 and 1 ≤ M ≤ N . Then, there exists an i ∈ {1, 2, • • • , d -1} so that p i < M/ √ b and p i+1 ≥ √ bM . Since N is b-smooth and p i < N , at least one of 2p i , 3p i , • • • , bp i must be a divisor of N . However, this is a contradiction because we have p i+1 /p i > ( √ bM )(M/ √ b) -1 = b, so none of 2p i , 3p i , • • • , bp i can be a divisor of N , which completes the proof.

Detailed information of datasets.

Average run time (ms) of PFT for N = 2 22 with different settings of M/p and M .

Average run time (ms) of PFT for N = 2 22 with different precision settings.

A.3 PROOF OF THEOREM 3

Proof. Let v = -2(l -q/2)/N . By the estimation in (5), the following holds:a qk+l e πivm -e πivµ • P r-1,ξ( ,r),π (v(m -µ)) e -2πimk/p e -πim/pIf l ranges from 0 to q -1, then |v| ≤ 2(q/2)/N = 1/p, and thus, M |v| ≤ M/p ≤ ξ( , r). We extend the domain of the function e πiv(m-µ) -(note that extending domain never decreases the uniform norm), and replace v(x -µ) with x , from which it follows that, where the second inequality holds since M |v| ≤ ξ( , r), hence the desired result.

