CORESET FOR RATIONAL FUNCTIONS

Abstract

We consider the problem of fitting a rational function g : R → R to a time-series f : {1, • • • , n} → R. This is by minimizing the sum of distances (loss function) ℓ(g) := n i=1 |f (i) -g(i)|, possibly with additional constraints and regularization terms that may depend on g. Our main motivation is to approximate such a time-series by a recursive sequence model G n = k i=1 θ i G n-i , e.g. a Fibonacci sequence, where θ ∈ R k are the model parameters, and k ≥ 1 is constant. For ε ∈ (0, 1), an ε-coreset for this problem is a data structure that approximates ℓ(g) up to 1 ± ε multiplicative factor, for every rational function g of constant degree. We suggest a coreset construction that runs in O(n 1+o(1) ) time and returns such a coreset that uses O(n o(1) /ε 2 ) memory words. We provide open source code as well as extensive experimental results, on both real and synthetic datasets, which compare our method to existing solvers from Numpy and Scipy.

1. BACKGROUND

The original motivation for this work was to suggest provable and efficient approximation algorithms for fitting input data by a stochastic model or its variants, such as Hidden-Markov Models (HMM) Basu et al. (2001) ; McCallum et al. (2000) ; Murphy (2002) ; Park et al. (2012) ; Sassi et al. (2020) ; Yu et al. (2010) and reference therein, Baysian Networks Acar et al. (2007) ; Murphy (2002) ; Nikolova et al. (2010) ; Rudzicz (2010) and reference therein, auto-regression Ghosh et al. (2013) , and Decision Markov Process Shanahan & den Poel (2010) . Informally, and in the context of this work, a model defines a time series (sequence, discrete signal, time series) F : [n] → R where [n] := {1, • • • , n}, and the value F (t) at time (integer) t ≥ 1 is a function of only the previous (constant) k ≥ 1 past values F (t -1), • • • , F (t -k) in the sequence, and the model's parameter θ.

1.1. AUTO-REGRESSION

Unfortunately, most existing results seem to be based on heuristics with little provable approximation guarantees. We thus investigate a simplified but fundamental version, called auto-regression, which has a provable but not so efficient solution using polynomial system solvers, after applying the technique of generating functions. This technique is strongly related to the Fourier, Laplace and z-Transform, as explained below. We define an auto-regression inspired by Ghosh et al. (2013) ; Eshragh et al. (2019) ; Yuan (2009) , as follows: Definition 1. A time-series F : [n] → R is an auto-regression (AR for short) of degree k, if there exist a vector of coefficients θ = (θ 1 , • • • , θ k ) ∈ R k such that F (t) = θ 1 F (t-1)+• • •+θ k F (t-k). The polynomial P (x) = x k -θ k x k-1 -• • • -θ 1 is called the characteristic polynomial of F . Substituting k = 2, θ = (1, 1) and F (1) = F (2) = 1 in Definition 1 yields the Fibonacci sequence, i.e. F (t) = F (t -1) + F (t -2), where F (1) = F (2) = 1. From Auto-regression to Rational functions. In the corresponding "data science version" of Fibonacci's sequence, the input is the time-series G(1), G(2), • • • , which is based on F with additional noise. A straight forward method to recover the original model is by directly minimizing of the squared error between the given noisy time-series and the fitted values, as done e.g. in Eshragh et al. (2019) ) using simple linear regression. However, this has a major drawback; AR time-series usually grows exponentially, like geometric sequences, and thus the loss will be dominated by the last few terms in the time-series. Moreover, small changes in the time domain have exponential affect over time, it makes more sense to assume added noise in the frequency or generative function domain. Intuitively, noise in analog signals such as audio/video signals from an analog radio/tv is added in the frequency domain, such as aliasing of channels Nyquist (1928) ; Karras et al. (2021) ; Shani & Brafman (2004) , and not in the time domain, such as the volume. To solve these issues, the fitting is done for the corresponding generation functions as follows. Proposition 1 (generative function Yuan (2009)) . Consider an AR time-series F and its characteristic polynomial P (x) of degree k. Let Q(x) = x k P ( 1 x ) be the polynomial whose coefficients are the coefficients of P in reverse order. Then, there is a polynomial R(x) of degree less than k such that the generative function of F is f (x) := ∞ i=1 F (i)x i-1 = R(x) Q(x) , for every x ∈ R. Inspired by the motivation above, we define the following loss function for the AR recovery problem Problem 1 (RFF). Given a time-series g : [n] → R and an integer k ≥ 1, find a rational function f : [n] → R whose numerator and denominator are polynomials of degree at most k that minimizes n x=1 |f (x) -g(x)|. Note that the loss above is for fitting samples from the generative function of a noisy AR as done in Section 3.1. While we will focus on sum of errors (distances), we expect easy generalization to squared-distances, robust M-estimators and any other loss function that satisfies the triangle inequality, up to a constant factor, as in other coreset constructions Feldman (2020) .

1.2. CORESETS

Informally, an input signal P consists of 2-dimensional points, a set Q of models, an approximation error ϵ ∈ (0, 1), and a loss function ℓ, a coreset C is a data structure that approximates the loss ℓ(P, q) for every model q ∈ Q, up to a multiplicative factor of 1 ± ϵ, in time that depends only on |C|. Hence, ideally, C is also much smaller than the original input P . Coreset for rational functions. Unfortunately, similarly to Rosman et al. (2014) , the RFF problem with general input has no coreset which is weighed subset of the input; see Claim 1. This was also the case e.g., in Jubran et al. (2021) . Hence, we solve this problem similarly to Rosman et al. (2014) , which requires us to assume that our input signal's first coordinate is simply a set of n consecutive integers, rather than a general set of reals. Even under this assumption there is no coreset which is a weighed subset of the input, or even a small weighed set of points; see Claim 1. We solve this problem by constructing a "representational" coreset that allows efficient storage and evaluation but dues not immediately yield an efficient solution to the problem as more commonly Feldman (2020) . For more explanation on the components of this coreset see Section 1.4. Why such coreset? A trivial use of such coreset is data compression for efficient transmission and storage. While there are many properties of coresets as mentioned at Feldman (2020) , some of them are non-immediate from our coreset; see Feldman (2020) for a general overview that was skipped due to space limitations. Nonetheless, since optimization over the coreset reduces the number of parameter, we hope that in the future there would be an efficient guaranteed solution (or approximation) over the coreset. Moreover, since this coreset does support efficient evaluation, we hope this coreset would yield an improvement for heuristics by utilizing this fast evaluation.

1.3. RELATED WORK

Polynomial approximations. While polynomials are usually simple and easy to handle, they do not suffice to accurately approximate non-smooth or non-Lipschitz functions; in such cases, high order polynomials are required, which leads to severe oscillations and numerical instabilities Peiris et al. (2021) . To overcome this problem, one might try to utilize piecewise polynomials or polynomial splines Northrop (2016); Nürnberger (1989) ; Sukhorukova (2010) . However, this results in a very complex optimization problem Meinardus et al. (1989) ; Sukhorukova & Ugon (2017) .

Rational function approximation.

A more promising direction would be to utilize rational functions for approximating the input signals, an example for this is Runge's phenomenon Epperson (1987) that resonates with Figure 4 in the appendix. Rational function approximation is a straight forward extension for polynomial approximations Trefethen ( 2019), yet are much more expressive due to polynomial in the denominator. A motivation for this can be found in the popular book Bulirsch et al. (2002) . A close relation between such functions and spline approximations has been demonstrated e.g., in Petrushev & Popov (2011) . Given an input signal consisting of n pairs of points in R 2 , the rational function fitting (or RFF) problem aims to recover a rational function that best fits this input, as to minimize some given loss function. Hardness of rational function approximation. To the best of our knowledge, rational function approximation has only been solved for the max deviation case (where the loss is the maximum over all the pairwise distances between the input signal and the approximation function) in Peiris et al. (2021) . Various heuristics have been suggested for other instances of the problem, see Peiris et al. (2021) with references therein.

1.4. NOVELTY

The suggested coreset in this paper is very different from previous coreset papers. To our knowledge, this is the first coreset for stochastic signals. Most existing coresets are motivated by problems in computational geometry or linear algebra, especially clustering, and subspace approximation. Their input is usually a set of points (and not a time series), with few exceptions e.g. coresets for linear regression Dasgupta et al. (2009) that can be considered as a type of hyperplane approximation. Our main challenges were also very different from existing coreset constructions. A typical coreset construction begins with what is known as an α-approximation or (α, β)-approximation for the optimal solution that can be easily constructed, e.g. using Arthur & Vassilvitskii (2007) ; Feldman & Langberg (2011); Cohen et al. (2015) . From this point the main change in these papers is to compute the sensitivity or importance of each point using the Feldman-Langberg framework Feldman & Langberg (2011) . The coreset is then a non-uniform random sample from the input set, based on the distribution of sensitivities. However, in this paper, the sensitivity of a point is simply proportional to its distance from our (α, β)-approximation. Therefore, the main challenge in this paper is to use our solver for rational functions, which takes time polynomial in n, to compute an efficient (α, β)-approximation. We cannot use the existing sample techniques as in Feldman & Langberg (2011) since it might create too many "holes" (non-consecutive sub-signals) in the input signal. A solution for computing bicriteria for k-linear segments was suggested in Feldman et al. (2012) ; Jubran et al. (2021) by partitioning the input signal into consecutive equal 2k sub-signals, so that at least half of them would contain only a single segment, for every k-segmentation. In our case, a k rational function cannot be partitioned into, say, O(k) linear or even polynomial functions. Instead, we computed a "weak coreset", which guarantees a desired approximation for a constrained version of the problem; in terms of constraints on both its input signal and feasible rational functions. We then combine this weaker coreset with a merge-reduce tree for maintaining an (α, β)approximation, which is very different from the classic merge-reduce tree that is used to construct coresets for streaming data. This tree of height at most log log n is also related to the not-socommon running time and space complexity of our suggested coreset. We hope that this coreset technique will be generalized in the future for more involved stochastic models, such as graphical models, Baysian networks, and HMMs.

2.1. DEFINITIONS

We assume that k ≥ 1 is an integer, and denote by R k the union of k-dimensional real column vectors. For every integer n ≥ 0, we denote [n] = {1, • • • , n}. For every vector c = (c 1 , • • • , c k ) ∈ R k and a real number x ∈ R, we denote by poly(c, x) = k i=1 c i • x i-1 the value of the polynomial of degree less than k whose coefficients are the entries of c at x ∈ R. For simplicity, we assume log(x) := log 2 (x). A weighted set is a pair (P, w) where P ⊆ R 2 and w : P → [0, ∞) is a weights function. A partition {P 1 , • • • , P θ } of a set P ⊂ R 2 is called consecutive if for every i ∈ [θ -1] we have min {x | (x, y) ∈ P i+1 } > max {x | (x, y) ∈ P i }. A query q ∈ R k 2 is a pair of k-dimensional coefficients vectors. For any integer n ≥ 1 an n-signal is a set P of pairs {(1, y 1 ), • • • , (n, y n )} in R 2 . Such an n-signal corresponds to an ordered set of n reals, a discrete signal, or to the graph (1, g(1)), ..(n, g(n)) of a function g : [n] → R. A set P ⊂ R 2 is an interval of an n-signal if P := {(a, y a ), (a + 1, y a+1 ), • • • , (b, y b )} for some a, b ∈ [n] such that a < b. The sets {a, a + 1, • • • , b} and {y a , y a+1 , • • • , y b } are the interval's first and second coordinates, respectively. Given a function f : R → R, the projection of P onto f is defined as a, f (a) , • • • , b, f (b) . The RFF problem is defined as follows. Note that the following definition of the rational function is inspired by Proposition 1. Definition 2 (RFF). We define ratio : R k 2 × R → R to be the function that maps every pair q = (c, c ′ ) ∈ R k 2 and any x ∈ R to ratio(q, x) = ratio (c, c ′ ), x :=    poly(c, x) 1 + x • poly(c ′ , x) if 1 + x • poly(c ′ , x) ̸ = 0 ∞ otherwise. For every pair (x, y) ∈ R 2 , the loss of approximating (x, y) via a rational function q is defined as D q, (x, y) := |y -ratio(q, x)|. For a finite set P ⊂ R 2 we define the RFF loss of fitting q to P as ℓ P, q = p∈P D(q, p) = (x,y)∈P |y -ratio(q, x)| . More generally, for a weighted set (P, w) we define the RFF loss of fitting q to (P, w) as ℓ (P, w), q = p∈P w(p)D(q, p) = (x,y)∈P w (x, y) |y -ratio(q, x)| . A coreset construction usually requires some rough approximation to the optimal solution as its input; see Section 1.4. Unfortunately, we do not know how to compute even a constant factor approximation to the RFF problem in Definition 2 in near-linear time, but only in (2kn) O(k) time; see Lemma 8. Instead, our work is mostly focused on efficiently computing an (α, β) or bicriteria approximation as Feldman & Langberg (2011) defined at Section 4.2. Definition 3 ((α, β)-approximation). Let P be an interval of an n-signal. Let {P 1 , • • • , P β } be a consecutive partition of P , and q 1 , • • • , q β ∈ R k 2 . The set {(P 1 , q 1 ) • • • , (P β , q β )} of β pairs is an (α, β)-approximation of P if β i=1 ℓ(P i , q i ) ≤ α • min q∈(R k ) 2 ℓ(P, q). If β = 1 the (α, 1)-approximation B = {(P 1 , q 1 )} is called an α-approximation. For every i ∈ [β] we denote by P ′ i the projection of P i onto q i , and by β i=1 P ′ i the projection of P onto B. We define the set of bicriterias of P as the union of all the (α ′ , β ′ )-approximations of P . A coreset for the RFF problem is defined as follows. Similarly to Rosman et al. (2014) it includes an approximation to allow a coreset construction despite the lower bound from Claim 1. Definition 4 (ϵ-coreset). Let P ⊆ R 2 be an n-signal, and let ϵ > 0 be an error parameter. Let B := {(P 1 , q 1 ), • • • , (P β , q β )} be a bicriteria approximation of P (see Definition 3), and let (C, w) be a weighted set. The tuple (B, C, w) is an ϵ-coreset for P , if for every q ∈ R k 2 we have ℓ(P, q) -ℓ (C, w), q -ℓ(P ′ , q) ≤ ϵ • ℓ(P, q), where P ′ := x, ratio(q i , x) | i ∈ [β], (x, y) ∈ P i is the projection of P onto B. The storage space required for representing such coreset is in O (|C| + βk).

2.2. ALGORITHMS OVERVIEW

For less technical intuition for the algorithms see Section J.4 at the appendix. For simplicity, in the following we consider k to be a constant. The input to Algorithm 1, which is the main algorithm, is an n-signal P , and input parameters ϵ, δ ∈ (0, 1/10]. Its output, with probability at least 1 -δ, is an ϵ-coreset (B, C, w) of P with size in O n o(1) /ϵ 2 for constant δ. Algorithm 2: coreset construction. As in Definition 4, the coreset consists of an (α, β)approximation B := {(P 1 , q 1 ) , • • • , (P β , q β )} of the input set P , and a weighted set (C, w). The weighted set (C, w) consists of a small sample C ⊆ P and its weights function w : C → (0, ∞). As in Feldman et al. (2012) , for every i ∈ [β] the probability of placing a point (x, y) ∈ P i into the sampled set C is Dist(q, p). Hence, a point (x, y) ∈ P whose y-value is far (not approximated well) from B would be sampled with high probability, but a point that is close to B will probably not be sampled. The weight w(p) of a point is inverse proportional to its probability to be chosen, so that the sum of distances to any query (rational function) is the same as its expectation. The rest of the algorithms aim to compute the (α, β)-approximation B. Algorithm 1: bi-criteria tree construction. We compute B via a balanced β-tree, which is similar to the classic merge-reduce tree that is usually used to compute coresets for streaming data Braverman et al. (2020) . However, the merge and reduce steps are different, as well as the number of children in each node. Each leaf of this tree corresponds to a 1-approximation (i.e. optimal solution) for a consecutive set of β := Θ (n 1/ log log(n) input points, which is computed via Algorithm 5. Hence, there are Θ(n/β) leaves. An inner node in the ith level corresponds to an (α i , β i )-approximation of its β child nodes, it has O β i leaves and O β i+1 input points of its sub-tree, for every i ∈ [ℓ], where ℓ = ⌈log log(n)⌉ -1 is the number of levels in the tree; follows since n 1/ log log(n) log log(n) = n. Here, α i = 3 i , β i = O(1) i , and thus (α, β) = (α ℓ , β ℓ ) ∈ O(log(n)), log(n) O(1) . Algorithm 3: the merge-reduce step. This step is computed in each inner node of the tree. For an inner node in the ith level, where i ∈ {2, • • • , ℓ}, the input is a set B of size β, where each B j ∈ B is an (0, r j )-approximation of P j (i.e. P j is projected unto B j ), where P 1 , • • • , P β is an equally-sized consecutive partition of an interval of an n-signal P , and the output is a bicriteria for P ; see Algorithm 3 and Fig. 1 . This is by computing the following for every possible subset G ⊆ [β] of size β -6k + 3: (i) Compute {(•, q G )} an α-approximation, for some α ≥ 1, for j∈g P j . (ii) For each j ∈ G set ℓ j := ℓ(P j , q G ) as the loss of q G for P j . (iii) Set G ′ ⊂ G to be the union of the 6k -3 indices j ∈ G with the largest value ℓ j . (iv) Set s G := j∈G\G ′ ℓ j ; the sum of the losses ℓ j excluding the largest |g ′ |. The final bicriteria approximation for the inner node is the one that minimizes s G , among every subset g ⊂ B of child nodes above. More precisely, we take its part that approximates the union of |B| -2|G ′ | children whose approximation error is s g , and take its union with the 2|G ′ | original (input) bicriterias that correspond to the child nodes (B \ g) ∪ g ′ . The final approximation error in the inner node is thus the minimum of s G over every G ⊆ B of size β -6k + 3. Algorithm 4: restricted coresets for rational functions. Computing the α-approximation in Step (i) above can be done by computing the optimal solution, with details shown in Lemma 8. However, this would take n O(1) and not quasi-linear time as desired. To this end, Algorithm 4 constructs a restricted coreset for rational functions that is restricted in the following two senses: (i) It assumes that the input signal is projected onto a bicriteria approximation, which is indeed the case in Step (i) of the merge-reduce step above. (ii) It approximates only specific rational functions, which are 2 k -bounded over the first coordinate of the input signal; see Definition 5. We handle the second assumption by removing the O(k) sets where the second assumption does not hold (via the exhaustive search described above). It should be emphasized that the final coreset construction has no such restrictions or assumptions for either its input or queries. This restricted coreset is computed on each child node, so that in Step (i) above we compute the α-approximation only on the union of |C| -|C ′ | coresets that corresponds to the chosen |C| -|C ′ | internal nodes. The size of the coreset, that fail with probability at most δ ∈ (0, 1/10], is m ∈ O (log (βn/δ)) 2 and thus the running time of our approximation algorithm (Algorithm 3) 1) , for every constant δ > 0. Algorithm 4 computes this restricted coreset by: (i) Partitioning the input into chunks of exponentially increasing sizes; see Fig. 5 . (ii) is m O(1) , which is in O n o( Computing a sensitivity based sub-sample for each chunk. (iii) Returning the union of those coresets after appropriate re-weighting. As we show in the proof, the sensitivity for the RFF inside each chunk can be reduced to the sensitivity of the polynomial fitting problem, which was previously tackled; see Corollary 2.

2.3. ALGORITHMS

From bicriteria to sampling-based coresets. In Feldman et al. (2012) a sampling based coreset construction algorithm was suggested for the k-segments problem. This algorithm requires as input an (α, β)-approximation as defined above. With some modifications to this algorithm, we can efficiently construct a coreset for the RFF problem. The missing part is a bicriteria approximation for the RFF, which is the main focus and main contribution of this work. Optimal solution (α = β = 1) in polynomial time. Using polynomial solvers, and previous work (see e.g., Marom & Feldman (2019) with references therein) it can be proven that given a set P of n points on the plane, we can compute the optimal fitting rational function to the point, i.e., the rational function that minimizes Equation 1 at Definition 2 in (2kn) O(k) time; see Lemma 8. Efficient (1, β)-approximation for large β. Using the polynomial time optimal solution above, we can compute an (1, β)-approximation to a n-signal P , for a large β. This is by partitioning the input into many (β) small subsets, and apply the optimal solution to each subset, which is relatively fast as each set is very small, more precisely at n•(2kβ) O(k) ; see Algorithm 5 for suggested implementation, note that there β corresponds to n/β at this paragraph. Algorithm 1: CORESET(P, k, ϵ, δ); see Theorem 1. Input : An n-signal P , an integer k ≥ 1, and constants ϵ, δ ∈ (0, 1/10]. Output: A tuple (B, C, w), where B is a bicriteria approximation of P , and (C, w) is a weighted set. β := n 1/ log(log(n)) ; β = ⌈n/β⌉; Λ = 6k -3 Set c * ≥ 1 to be a constant that can be determined from the proof of Theorem 1.  λ 1 := c * (4 k+1 k 2 + 1) k 2 log(4 k+1 k 2 + 1) + log kn δ B := (B 1 , • • • , B |B| ) := BATCH-APPROX(P, β)// see Algorithm 5. while |X| > 2β do Set B ′ 1 , • • • , B ′ |B|/ψ to B := (B 1 , • • • , B ψ ) B ′ := B 1 , • • • , B |B| // B ′ is an (un-ordered) set. B := REDUCE(B ′ , λ 1 , Λ). λ 2 := c * ϵ 2 • log(n) • k 2 log log(n) + log n δ B, C, w := SAMPLE-CORESET B, λ 2 // see Algorithm 2. return B, C, w . The following theorem proves the correctness of Algorithm 1. See Theorem 5 for a full proof. Theorem 1. Let P be an n-signal, for n that is a power of 2, and put ϵ, δ ∈ (0, 1/10]. Let (B, C, w) be the output of a call to CORESET(P, k, ϵ, δ); see Algorithm 1. With probability at least 1 -δ, (B, C, w) is an ϵ-coreset of P ; see Definition 4. Moreover, the computation time of (B, C, w) is in O(k) , and the memory words required to store (B, C, w) are in (2k) O(1) • log(n) O(1)+log(k) • log(1/δ)/ϵ 2 . In particular, if k and δ are constants a running time of O n 1+o(1) and the space is in O n o(1) /ϵ 2 . In Line 4, an input n-signal P (black ticks and red dots) and its partition {P 1 , • • • , P 16 } into ψ = 16 sets via a call to Algorithm 5. This call also computes (1, 1)-approximation B i = {(P i , Q i )} for every P i (green curves), see Definition 3. In Line 6, the set 2 O(k 2 ) • n • n O(k)/ log log(n) • log(n) O(k log(k)) • log(1/δ) {B 1 , • • • , B 16 } is partitioned into B = B ′ 1 ∪ B ′ 2 ∪ B ′ 3 ∪ B ′ 4 , where each such set contains 4 elements from B. (Middle) In Line 7, for every i ∈ [4] we set B i = P (1) i , Q (1) i , • • • , P (|Bi|) i , Q |Bi| i as the output of a call to REDUCE(B ′ i ). For every i ∈ [3] the projection of every 

3. EXPERIMENTAL RESULTS

We implemented our coreset construction from Algorithm 1 in Python 3.7 and in this section we evaluate its empirical results, both on synthetic and real-world datasets. More results are placed in the supplementary material; see Section I. Open-source code can be found in Code (2022) . Note that since it is non trivial how to accelerate the computation of the optimal solution using the coreset we will focus on the quality of the compression, similarly to video compression analysis. Hardware. We used a PC with an Intel Core i7-10700, NVIDIA GeForce GTX 1660 SUPER (GPU), and 16GB of RAM. The gap between theory and practice. In practice, we apply two minor changes to our theoretically-provable algorithms, which have seemingly neglectable effect on their output quality, but aim to improve their running times. Those changes are: (i): Line 11 of Algorithm 3 computes the query with the smallest cost ℓ (S, w), r over every r ∈ R k × R k . Instead, to reduce the computational time, we iterate only over the queries in FAST-CENTROID-SET(S, 64); see Algorithm 6. (ii): Line 2 of Algorithm 5 computes, for every P i ⊂ R 2 defined in Algorithm 5, the query with the smallest cost ℓ P i , r over every r ∈ R k × R k . Instead, to reduce the computational time, for every such P i i , we iterate only over the queries in FAST-CENTROID-SET(P i , 64); see Algorithm 6. (iii) At Line 11 of Algorithm 4, the set S j i was sampled from P j i an interval of an n-signal, where each point (x, y) ∈ P j i was sampled with probability s ′ (x). We observed, in practice, that the probabilities assigned by s ′ were sufficiently close to 1/|P j i | for most of the points. Hence, to reduce the computational time, we sampled the set S j i uniformly from P j i . Global parameters: We used the degree k = 2 in our experiments, since it seemed as a sufficiently large degree to allow a visually pleasing approximation as seen, for example, in Figure 13 . Competing methods. We consider the following compression schemes: (i): RFF-coreset(P, λ) -The implementation based on Algorithm 1, where we set β = 32, β = 32, λ 1 = 32, and Λ = 0. (ii): FRFF-coreset(P, λ) -A heuristic modification to RFF-coreset above, where the call to REDUCE at Line 7 of Algorithm 1 is replaced with a call to FAST-CENTROID-SET; see Algorithm 6. This should boost the running time by slightly compromising accuracy. (iii): Gradient(P, λ) -A rational function q ∈ R k 2 was fitted to the input signal via scipy.optimize.minimize with the default values, where the function minimized is ℓ(P, q), and the starting position is {0} k × {0} k . Then, a coreset was constructed using the provable algorithm SAMPLE-CORESET({(P, q)} , λ). (iv): L ∞ Coreset(P, λ) -A rational function q ∈ arg min q ′ ∈(R k ) 2 max p∈P D(p, q ′ ) was computed based on Peiris et al. (2021) . Then, a coreset was constructed using the provable algorithm SAMPLE-CORESET({(P, q)} , λ). (v): RandomSample(P, λ) -returns a uniform random sample of size 1.5 • λ from P . (vi): NearConvexCoreset(P, λ) -The coreset construction from Tukan et al. (2020) , chosen to be of size 1.5 • λ. Coreset size. In all the following experiments, for fair comparison, the input parameters of all the competing methods above were tuned as to obtain an output coreset of the same desired size. Note that there can be small noise in the sizes due to repeated samples of the same element. Repetitions. Each experiment was repeated 100 times. All the results are averaged over the tests. Evaluation. The quality of a given compression scheme (from the competing methods above) is defined as ε(q) = 100 • |ℓ ′ -ℓ| ℓ , where ℓ ′ is the loss in Equation 1 when plugging in the compression and some query q, and ℓ is the same loss but when plugging the full input signal and the same query q. We tested a couple of different options for such a query q: (i): as it is hard to compute the optimal solution q * which minimizes Equation 1, we sampled a set Q of 1024 queries using Algorithm 6 and recovered the query q ∈ Q which has the smallest loss over the full data. This query aims to test the compression accuracy in a near-optimal scenario. We then presented ε(q) for the q above. (ii): For every individual compression scheme, we picked the query in q ∈ Q, which yields the largest value for ϵ(q); i.e., the least satisfied query.

3.1. SYNTHETIC DATA EXPERIMENT

In this experiment, we aim to reconstruct a given noisy homogeneous recurrence sequence's explicit representation, as explained in Section 1 (specifically, see Definition 1 and Proposition 1). Dataset. Following the motivation in Section 1, the data (n-signal) P we used in this test is simply a set of n values {F (x) | x = j/n -1/2, j ∈ [n]} of the generating function of a noisy Fibonacci series, i.e., F (x) = 99 i=0 s i+1 • x i where s i is the i'th element in the Fibonacci series with Gaussian noise with zero mean and a standard deviation of 0.25. Results. Fig. 2 presents the results, along with error bars that present the 25% and 75% percentiles.

3.2. REAL-WORLD DATA EXPERIMENT

In this experiment we consider an input signal which contains n readings of some quantity (e.g., temperature or pressure) over time. We aim to fit a rational function to such a signal, with the goal of predicting future readings and uncovering some behavioral pattern. Dataset. In this experiment we consider the Beijing Air Quality Dataset Chen (2019) from the public UCI Machine Learning Repository Asuncion & Newman (2007) . The dataset contains n = 7344, 8760, 8760, 8784 readings, respectively for the years 2013 to 2016. Each reading contains "the temperature (degree Celsius, denoted by TEMP), pressure (hPa, denoted by PRES), and dew point temperature (degree Celsius, denoted by DEWP)". For each year and property individually, we construct an n-signal of the readings over time. Missing entries were replaced with the average property value. Fig. 9 presents the dataset readings along with our rational function fitting algorithm and Scipy's rational function fitting. The Italy Air Quality Dataset Vito (2016) is also tested; see Section I.2 in the appendix. Results. Fig. 3 presents the results for the year 2016, along error bars that present the 25% and 75% percentiles. Graphs for other years along with results for dataset Vito (2016) are placed in Section I. 

3.3. DISCUSSION

Fig. 4 demonstrates that rational function fitting is more suitable than polynomial fitting, for a relatively normal dataset. It also shows that computing any of those fitting functions either on top of the full data or our coreset produces similar results. Lastly, Fig. 2 and 3 demonstrate that our method and its variants achieve, almost consistently, better approximation quality in faster time, as compared to most of the competing methods; the only faster methods were the uniform sample and gradient which yielded significantly worse results. For a more in depth discussion see Section J which is at the appendix due to space limitation.

4. FUTURE WORK AND CONCLUSION

This paper provides a coreset construction that gets a time-series and returns a small coreset that approximates its sum of (fitting) distances to any rational functions of constant degree, up to a factor of 1 ± ϵ. The size of the coreset is sub-linear in n and quadratic in 1/ϵ. Our main application is fitting to Auto-Regression model, whose generative functions are rational. While we focused on sum of errors (distances), we expect easy generalization to squared-distances, robust M-estimators and any other loss function that satisfies the triangle inequality, up to a constant factor, as in other coreset constructions. We believe that the new suggested technique initializes a line of research that would enable sub-linear time algorithms with provable approximation for more sophisticated stochastic models such mentioned at the start of the paper in Section 1.

ETHICS STATEMENT

To the best of our knowledge, there are no ethical concerns for our work due to the following: • The work is of theoretical nature that aims to develop efficient coresets for rational functions (and approximation is a major part of the contribution). • All the datasets which we tested our methods on where from the public UCI Machine Learning Repository Asuncion & Newman (2007) , and more precisely Vito (2016) and Chen (2019) .

REPRODUCIBILITY STATEMENT

In all our tests we included error bars, hardware used, global parameters, dataset (citing existing data or its generalization method) and the paper contains full pseudo-code for all the Algorithms. Where there were assumptions on the data they were stated explicitly. As stated in Code ( 2022), the authors commit to publish the code for all of the tests (or part of them) upon acceptance of this paper or reviewer request.

A ADDITIONAL MOTIVATION FOR RFF

In the following figure we demonstrate that in some cases rational functions can yield better approximations than polynomials, this is essentially a variation of the known Runge's phenomenon Epperson (1987) . We do not use a rational function, that is commonly used to demonstrate Runge's phenomenon Epperson (1987) , since we also want to demonstrate the superiority of our methods upon existing solvers, and if all the points are exactly on a rational function existing rational interpolation methods can solve this case as well; for example Padé approximant as mentioned in Baker Jr (1964) or rational function fitting for max deviation as mentioned in Peiris et al. (2021) . While rational functions can give better fitting than polynomials at some cases the same also holds for polynomials, where at some instances polynomials would yield a significantly better fitting. For an example where the polynomial fitting yielded better results than our methods see Figure 9 and Figure 12 with discussion at Section J.3.1. 

B ADDITIONAL ALGORITHMS B.1 SAMPLE BASED ON A BICRETIRIA APPROXIMATION ALGORITHM

Algorithm 2 gets as input a bicretiria approximation of the RFF problem for some given input nsignal P ; see Definition 3. The algorithm utilizes this rough approximation in order to compute, in linear time, an ϵ-coreset as in Definition 4 via sensitivity sampling. The formal statement is given in Lemma 1. This algorithm is a modified version of the algorithm presented for the k-segments problem in Feldman et al. (2012) . The following lemma states the desired properties of Algorithm 2; see Lemma 6 for its proof. Lemma 1. Let B := (P 1 , q 1 ), • • • , (P β , q β ) be an (α, β)-approximation of some n-signal P , for some α > 0; see Definition 3. Put ϵ, δ ∈ (0, 1/10], and let λ := c * ϵ 2 (α + 1) k 2 log(α + 1) + log(1/δ) , where c * ≥ 1 is a constant that can be determined from the proof. Let (B, C, w) be the output of a call to SAMPLE-CORESET(B, λ); see Algorithm 2. Then, Claims (i)-(ii) hold as follows: (i) B, C, w can be stored using O (λ + βk) memory. (ii) With probability at least 1 -δ, we have that (B, C, w) is an ϵ-coreset of P ; see Definition 4. Algorithm 2: SAMPLE-CORESET B, λ ; see Lemma 1. Input : A bicirteria approximation B := (P 1 , q 1 ), • • • , (P β , q β ) of some n-signal P ; see Definition 3. An integer λ ≥ 1 for the sample size. Output: A tuple (B, C, w), where B is a bicriteria approximation of P , and (C, w) is a weighted set. c := β i=1 ℓ(P i , q i ) if c ∈ {0, ∞} then Let w : R 2 → {0} such that for every p ∈ R 2 we have w(p) = 0. return B, ∅, w . s(p) := D(q i , p)/c for every i ∈ [β] and every p ∈ P i . Pick a sample S ⊂ P of λ points from P , where S is a multi-set and each point p ∈ S is sampled i.i.d. with probability s(p); observe that there might be repetitions in S. Set r(p) as the number of repetitions of p in the multi-set S, for every p ∈ P . w(p) := r(p)/ λ • s(p) for every p ∈ S. S ′ := {(x, ratio(q i , x)) | i ∈ [β], (x, y) ∈ P i ∩ S} // project the labels of every set P i ∩ S onto q i and take their union. for every i ∈ {1, • • • , β} do w x, ratio(q i , x) := -w(p) for every p ∈ S ∩ P i C := S ∪ S ′ return B, C, w

B.2 FROM LARGER TO SMALLER VALUES OF β

In this section, we show how, given an (α, β)-approximation with large β, we can recover an (α ′ , β ′ )-approximation with β ′ < β but a larger α ′ > α. This is achieved by computing an approximation to the projection of the set of points approximated by the (α, β)-approximation onto the (α, β)-approximations. This is implemented in Algorithm 3, which utilizes Algorithm 4. To efficiently compute the previously stated bicriteria we will utilize restricted coresets. It should be emphasized that the final coreset will not contain such restriction. One of the limitation on the restricted coresets will involve the following definition. Definition 5 (ρ-bounded function). For every X ⊂ R, ρ ∈ [1, ∞), and any (c, c ′ ) ∈ R k 2 we say that (c, c ′ ) is ρ-bounded over X if and only if max x∈X f (x) min x∈X f (x) ≤ ρ, where the function f : R → R maps every x ∈ R to f (x) = 1 1 + x • poly(c ′ , x) .

Overview of Algorithm 3

The input for the algorithm is P an interval of n-signal which is projected onto some set of (α, β)-approximations. This projection is represented by the set B, where each element B i ∈ B is a (0, β)-approximation for some P i , and P 1 , • • • , P |B| is a consecutive partition of P . The algorithm also receives a parameter λ which control the size of the output, and a parameter δ which control the trade-off between the running time and robustness. The algorithm returns B ′ a bicriterai of P , where the size of B ′ is smaller than |B| i=1 |B i |. The algorithm runs in O(|P | 1+o(1) ) time. This (α, β)-approximation is computed as mentioned in Section 2.2. Algorithm 3: REDUCE(B, λ, Λ); see Lemma 2. Input : A set B := {B 1 , • • • , B β }, where each B i ∈ B is an (0, r i )-approximation of P i , i.e. P i is projected unto B i , and {P 1 , • • • , P β } is an equally-sized consecutive partition of P , some interval of an n-signal; see Figure 6 and Definition 3. Integers λ ≥ 1 and Λ ≥ 0. Output: A bicriteria approximation B ′ of P ; see Definition 3. ℓ * := ∞; B ′ := β i=1 B i for every B i ∈ B do Identify B (1) i , • • • , B (ri) i := B i . for every B (j) i ∈ B i do S (j) i , w (j) i := MINI-REDUCE B (j) i , λ // see Algorithm 4. for every set G ⊂ {1, • • • , β} of size |G| = β -Λ do S G := ∅. for every i ∈ G and B (j) i ∈ B i do Set w G (p) := w (j) i (p) for every p ∈ S (j) i . S G := S G ∪ S (j) i Set q G ∈ arg min q∈(R k ) 2 ℓ (S G , w G ), q ; see Definition 2 and Lemma 8. for every i ∈ G do ℓ i := ℓ(P i , q G ) Set G ′ ⊂ G to be the union of the 6k -3 indices i ∈ G with the largest value ℓ i . Ties broken arbitrarily. if i∈G\G ′ ℓ i < ℓ * then ℓ * := i∈G\G ′ ℓ i // update smallest loss Set {R 1 , • • • , R γ } to be the smallest partition of G \ G ′ such that for every i ∈ [γ] we have G ′ ∩ [min(R i ), max(R i )] = ∅ , and for any i, j ∈ [γ], where i ̸ = j, we have R j ∩ [min(R i ), max(R i )] = ∅. // Via simple greedy partition; see Fig 5 P ′ i := x, ratio(q, x) | (x, y) ∈ P i for every i ∈ G // the projection of P i onto q. P * i := ψ∈Ri P ′ i , for every i ∈ [γ] // Union of all the sets P ′ ψ with index ψ in R i . B ′ := P * 1 , q G , • • • , P * γ , q G ) ∪ B (j) i ∈ B i | i ∈ G ′ ∪ [β] \ G return B ′ . Note that in the following lemma λ is different than in Lemma 3. This is since Algorithm 4 would be called as a subroutine at most n times, and as such we need to adjust the failure probability in this use of Lemma 3 to δ/n. The following lemma states the desired properties of Algorithm 3; see Lemma 14 for its proof. Lemma 2. Let B := {B 1 , • • • , B β } such that there is {P 1 , • • • , P β } an equally-sized consecutive partition of some interval of n-signal P, |P | ≥ 2k , where each B i ∈ B is an (0, r i )-approximation of P i , i.e. P i is projected unto B i ; see Figure 6 and Definition 3. Put ϵ, δ ∈ (0, 1/10], and let λ := c * ϵ 2 (4 k+1 k 2 + 1) k 2 log 2 (4 k+1 k 2 + 1) + log 2 nk δ , be an integer, where c * ≥ 1 is a constant that can be determined from the proof. Let B ′ be the output of REDUCE(B, λ, 6k -3); see Algorithm 3. With probability at least 1 -δ, we have that B ′ is a (1 + 10ϵ, β * )-approximation of P for some β * ≥ 1; see Definition 3. Figure 6 : Illustration for the input to Algorithm 3. The set {P 1 , P 2 , P 3 } is an equally-sized consecutive partition of P an interval of an n-signal, where for every i ∈ {1, 2, 3} the set P i is projected onto a bicriteria of size 2 which is denoted by B (1) i , B (2) i . Overview of Algorithm 4. The input for the algorithm is the projection of an interval of an nsignal onto some rational function. This projection is represented by an (0, 1)-approximation; see Definition 3. The algorithm also receives an integer λ ≥ 1 which controls the size of the output. The algorithm computes a (restricted) coreset for the projected input, for which the approximation guarantees hold only for the subset of the set of rational functions that are 2 k -bounded over the first coordinate of the input interval; see Definition 5 and Lemma 3. Algorithm 4 computes this restricted coreset as mentioned in section 2.2. We note that in our code we substitute the precise computation of the partition in Line 3 of Algorithm 4 presented in Lemma 11 by an "approximate partition", where the roots of the polynomials in Lemma 11 are approximated by numeric methods; done using the method roots from the library numpy, which to the best of our knowledge utilizes Horn & Johnson (1999) . Algorithm 4: MINI-REDUCE(B, λ); see Lemma 3. Input : An interval of an n-signal P which is projected unto some q ∈ R k 2 , i.e., ℓ(P, q) = 0. This is represent by B := {(P, q)}, which is a (0, 1)-approximation B of P ; see Definition 3. An integer λ ≥ 1 Output: A weighted set (S, w), i.e., S ⊂ R 2 and w : S → R; see Section 2.1. X := {x | (x, y) ∈ P }, i.e., X is the union over the first coordinate of every pair in P . (c, c ′ ) := q ∈ R k 2 . Let {X 1 , • • • , X η } be a partition of X into η ∈ O(k) sets, such that for every i ∈ [η] the function f (x) = |1 + x • poly(c ′ , x)| is monotonic over min(X i ), max(X i ) , and for every i, j ∈ [η], where i ̸ = j we have X i ∩ min(X j ), max(X j ) = ∅; see Lemma 11. S := ∅ for every i ∈ {1, • • • , η} do Let X 1 i , • • • , X mi i be a consecutive partition of X i into m i ∈ Θ log(|X i |) sets such that for every j ∈ [m i ] we have |X j i | = 2 min{j-1,mi-j} . // See Figure 7 for illustration. for every j ∈ [m i ] do Let s : X j i → (0, ∞) such that s(x) ≥ sup c |poly(c, x)| x ′ ∈X j i |poly(c, x ′ )| for every x ∈ X j i , and x∈X j i s(x) ∈ O(k 2 ) where the supremum is over c ∈ R 2k+1 such that |poly(c, x)| > 0; see Corollary 2. Set s ′ (x) := s(x) x ′ ∈X j i s(x ′ ) for every x ∈ X j i . P j i := x, ratio(q, x) | x ∈ X j i // see Definition 2. Pick a sample S j i of λ i.i.d. points from P j i , where each (x, y) ∈ P j i is sampled with probability s ′ (x). S := S ∪ S j i Set w(p) := 1/ λ • s ′ (x) for every p = (x, y) ∈ S. return (S, w) The following lemma states the desired properties of Algorithm 4; see Lemma 13 for its proof. Lemma 3. Let P be an interval of a n-signal which is projected unto some q ∈ R k 2 , i.e., ℓ(P, q) = 0. Let B := {(P, q)}, which is a (0, 1)-approximation B of P ; see Definition 3. Let X be the first coordinate of P , i.e., X := {x | (x, y) ∈ P }. Put ϵ, δ ∈ (0, 1/10], and let λ ≥ c * ϵ 2 • (4 k+1 k 2 + 1) k 2 log(4 k+1 k 2 + 1) + log k log n δ , be an integer, where c * > 1 is a constant that can that can be determined from the proof. Let (S, w) be the weighted set that is returned by a call to MINI-REDUCE(B, λ); see Algorithm 4. Then |S| ∈ O (kλ • log n) and, with probability at least 1 -δ, for every q ′ ∈ R k 2 that is 2 k -bounded over X (see Definition 5), we have |ℓ(P, q ′ ) -ℓ ((S, w), q ′ )| ≤ ϵ • ℓ(P, q ′ ). (2) B.3 COMPUTING AN (α, β)-APPROXIMATION WITH LARGE β ALGORITHM Overview of Algorithm 5. Algorithm 5 takes as input an n-signal P and an integer β ≥ 1. It aims to partition P into ψ ∈ Θ (β) sets P 1 , P 2 , • • • , P ψ of the same size, and to compute, for every such set P i , the query q i ∈ R k 2 that minimizes the RFF loss ℓ(P i , q) for this set. The algorithm outputs the sets P i in the partition of P , each equipped with its optimal query q i . As the time to compute the optimal query for each set in the partition of P depends polynomially on the size of the set, we need those sets to be small. Unfortunately, this implies that, we must plug a large value of β. To this end, this algorithm on its own does not suffice in order to compute the desired (α, β)-approximation with small values of both α and β. However, this algorithm is still utilized in Algorithm 1 as some sort of initialization. Algorithm 5: BATCH-APPROX(P, β); see Lemma 4. Input : An n-signal P where n ≥ 2k is a power of 2, and a positive integer β. Output: An ordered set B that contains ψ ∈ O(β) (B 1 , • • • , B ψ ), where every B i is an (1, 1)-approximation of a set P i in some consecutive partition {P 1 , • • • , P ψ } of P ; see Definition 3. Compute an equally-size partition {P 1 , • • • , P ψ }, where ψ ∈ ⌊β/2⌋, β , of P whose size is |P 1 | = 2 m , for some integer m ≥ 1. For every i ∈ [ψ], let q i be the optimal fitting rational function for P i for every i ∈ [ψ], i.e. q i ∈ arg min q∈(R k ) 2 ℓ(P i , q); see Lemma 8 for an implementation. Set B := (B 1 , • • • , B ψ ), where B i := {(P i , q i )}, for every i ∈ [ψ]. return B. Lemma 4. Let P be an n-signal, where n is a power of 2. Let β be a positive integer. Let B := (B 1 , • • • , B ψ ) , where ψ ∈ ⌊β/2⌋, β , be the output of a call to BATCH-APPROX(P, β); see Algorithm 5. Put {(P i , q i )} := B i , for every i ∈ [ψ]. Then, B ′ := {(P 1 , q 1 ), • • • , (P ψ , q ψ )} is an (1, β)approximation of P ; see Definition 3. Moreover, the output of the call to BATCH-APPROX(P, β) can compute in n • (2kn/β) O(k) time. Proof. By its construction in Algorithm 5 we have that B ′ is an (1, β)-approximation of P . By Lemma 8, for every i ∈ [ψ], the computation time of every q i in Line 2 of Algorithm 5 is in k) . Combining this with the construction of Algorithm 5 proves the lemma. 2k|P i | O(

B.4 FAST PRACTICAL HEURISTIC

Unfortunately the running time of the algorithms is still large. Therefore, we suggest a heuristic to run on top of our coreset. We later prove that, under some assumptions, this heuristic gives a constant factor approximation. For this heuristic we need the following definition. Definition 6. Let S be a set of 2k points on the plane. We define SOLVER(S) as an arbitrary (c, c ′ ) ∈ R k 2 that satisfies ℓ(S, q) = 0 if there is such a pair, otherwise it is empty. In Lemma 20, we prove, that if {x • y | (x, y) ∈ S} = 2k, then SOLVER is never empty and that it can be computed in O k 3 time. In our companion code, we sample G directly from P . Algorithm 6: FAST-CENTROID-SET P, β ; see Lemma 21. Input : A finite set P ⊂ R 2 of at least 2k points, where ∀S ⊂ P, |S| = 2k : {x • y | (x, y) ∈ S} = 2k , and an integer β ≥ 1. C ALGORITHM 2: CORESET GIVEN AN (α, β)-APPROXIMATION Output: A set G ⊂ R k 2 of size |G| ≤ β. G := {S ⊆ P | |S| = 2k}. if |G| ≥ β then // |G| := |P | 2k return S∈G SOLVER(S)// see Definition 6. Pick a sample G ′ ⊂ G of |G ′ | = β The coreset construction that we use in Algorithm 2 is a non-uniform sample from a distribution, which is known as sensitivity, that is based on the (α, β)-approximation defined in Definition 3. To apply the generic coreset construction we need two ingredients: (i) A bound on the dimension induced by the query space ("complexity") that corresponds to our problem as formally stated and bounded in subsection C.1. This bound on the dimension induced by the query space determines the required size of the random sample picked in Algorithm 2. (ii) A bound on the sensitivity as formally stated and bounded in the proof of Lemma 6. This bound on the sensitivity determines the required size of the random sample that is picked in Algorithm 2.

C.1 BOUND ON THE DIMENSION OF THE QUERY SPACE

We first define the classic notion of VC-dimension, which is used in Theorem 8.14 in Anthony & Bartlett (2009) , and is usually related to the PAC-learning theory Li et al. (2001) . Definition (VC-dimension Lucic et al. ( )). Let F ⊂ R d → {0, 1} and let X ⊂ R d . Fix a set S = {x 1 , • • • , x n } ⊂ X and a function f ∈ F . We call S f = {x i ∈ S | f (x i ) = 1} the induced subset of S by f . A subset S = {x 1 , • • • , x n } of X is shattered by F if | {S f | f ∈ F } | = 2 n . The VC-dimension of F is the size of the largest subset of X shattered by F . Theorem 2. Let h be a function from R m × R d to {0, 1}, and let H = {h θ : R d → {0, 1} | θ ∈ R m }. Suppose that h can be computed by an algorithm that takes as input the pair θ ∈ R m × R d and returns h θ (x) after no more than t of the following operations: • the arithmetic operations +, -, ×, and / on real numbers, • jumps conditioned on >, ≤, <, ≥, =, and ̸ = comparisons of real numbers, and • output 0, 1. Then the VC-dimension of H is O m 2 + mt . For the sample mentioned in the start of Section C we utilize the following generalization of the previous definition of VC-dimension. This is commonly referred to VC-dimension, but to differentiate this definition from the previous, and to be in line with the notations in Feldman et al. (2019) we abbreviate it to dimension. This is the dimension induced by the query space which would be assigned in Theorem 3 to obtain the proof of Algorithm 2.  Q to [0, ∞). For every Q ∈ Q and r ≥ 0, let range(F, Q, r) = {f ∈ F | f (Q) ≥ r}. Let ranges(F ) = {range(F, Q, r) | Q ∈ Q, r ≥ 0}. Finally, let R Q,F = (F, ranges(F )) be the range space induced by Q and F . In the following lemma, which is inspired by Theorem 12 in Lucic et al. (2017) , we bound the VC-dimension which would be assigned in Theorem 3 to obtain the proof of Algorithm 2. Lemma 5. Let B = {(P 1 , q 1 ), • • • , (P β , q β )} be an (α, β)-approximation of some n-signal P ) , be the function that maps every p = (x, y) ∈ P , where p ∈ P i , and any q ∈ R k 2 to f (x) = D q, ratio(q i , x) . = {(1, y 1 ), (2, y 2 ), • • • , n, y n )}; see Definition 3. Let f : P × R k 2 → [0, ∞ For every i ∈  [n] let f i : R k 2 → [0, ∞) denote the function that maps every q ∈ R k 2 to f i (q) = f (q, (i, y i )). Let F = {f 1 , . . . , f n }. The dimension of the range space R (R k ) 2 ,F that is induced by R k 2 and F is in O(k 2 ). Proof. For every (q, r) = c, c ′ ), r ∈ R k 2 × R, let h (c|c ′ |r) : R → {0, 1} that maps every x ∈ [n] to h (c|c ′ |r) (x) = 1 if and only if f i (q) ≤ r, and every x ∈ R \ [n] to h (c|c ′ |r) (x) = 0. Let H = {h θ | θ ∈ R 2k+1 }.

C.2 SENSITIVITY OF FUNCTIONS

For the self containment of the work we state previous work on sensitivity of functions. Observe that the following is stated in a more general form than required in this section. This is since we would re-use the stated results in later parts for the restricted coreset, while in this section we bound the sensitivity to the projection unto a bicretiria; see Section 4 and Definition 3. Definition 8 (query space Feldman et al. ( 2019)). Let P ⊂ R 2 be a finite non empty set. Let f : P × R k 2 → [0, ∞) and loss : R |P | → [0, ∞) be a function. The tuple (P, R k 2 , f, loss) is called a query space. For every q ∈ R k 2 we define the overall fitting error of P to q by f loss (P, q) := loss (f (p, q) p∈P ) = loss f (p 1 , q), . . . , f (p |P | , q) . To emphasize that the following coreset is a subset of the input set, in contrast to ϵ-coreset as in Definition 4, we call it subset-ϵ-coreset. In Section C.4 we prove that there is no such coreset for the RFF problem; see Definition 2. Definition 9 (subset-ϵ-coreset Feldman et al. (2019) ). Let (P, R k 2 , f, loss) be a query space as in Definition 8. For an approximation error ϵ > 0, the pair S ′ = (S, u) is called an subset-ϵ-coreset for the query space (P, R k 2 , f, loss), if S ⊆ P, u : S → [0, ∞), and for every q ∈ R k 2 we have (1 -ϵ)f loss (P, q) ≤ f loss (S ′ , q) ≤ (1 + ϵ)f loss (P, q). Definition 10 (sensitivity of functions). Let P ⊂ R 2 be a finite and non empty set, and let F ⊂ {P → [0, ∞]} be a possibly infinite set of functions. The sensitivity of every point p ∈ P is S * (P,F ) (p) = sup f ∈F f (p) p∈P f (p) , ( ) where sup is over every f ∈ F such that the denominator is positive. The total sensitivity given a sensitivity is defined to be the sum over these sensitivities, S * F (P ) = p∈P S * (P,F ) (p). The function S (P,F ) : P → [0, ∞) is a sensitivity bound for S * (P,F ) , if for every p ∈ P we have S (P,F ) (p) ≥ S * (P,F ) (p). The total sensitivity bound is then defined to be S (P,F ) (P ) = p∈P S (P,F ) (p). The following theorem proves that a coreset can be computed by sampling according to sensitivity of functions. The size of the coreset depends on the total sensitivity and the complexity (VCdimension) of the query space, as well as the desired error ϵ and probability δ of failure. Theorem 3 (coreset construction Feldman et al. (2019) ). Let • P = {p 1 , • • • , p n } ⊂ R 2 be a finite and non empty set, and f : P × R k 2 → [0, ∞). • F = {f 1 , . . . , f n }, where f i (q) = f (p i , q) for every i ∈ [n] and q ∈ R k 2 . • d ′ be the dimension of the range space that is induced by R k 2 and F . • s * : P → [0, ∞) such that s * (p) is the sensitivity of every p ∈ P , after substituting P = P and F = f ′ : P → [0, ∞] | ∀p ∈ P, q ∈ R k 2 : f ′ (p) := f (p, q) in Definition 10, and s : P → [0, ∞) be the sensitivity bound of s * . • t = p∈P s(p). • ϵ, δ ∈ (0, 1). • c > 0 be a universal constant that can be determined from the proof. • λ ≥ c(t + 1) d ′ log(t + 1) + log(1/δ) ϵ 2 . • w : P → {1}, i.e. a function such that for every p ∈ P we have w(p) = 1. • (S, u) be the output of a call to CORESET-FRAMEWORK(P, w, s, λ) (Algorithm 1 in Feldman et al. ( 2019)). Then the following holds • With probability at least 1 -δ, (S, w) is an subset-ϵ-coreset of size |S| ≤ λ for the query space (F, R k 2 , f, ∥ • ∥ 1 ); see Definition 9.

C.3 ANALYSIS OF ALGORITHM 2: SAMPLE-CORESET

In the following lemma we prove Lemma 1, which proves that, given values that satisfy specific properties, we have that Algorithm 2 yields an ϵ-coreset; see Definition 4. Lemma 6. Let B := (P 1 , q 1 ), • • • , (P β , q β ) be an (α, β)-approximation of some n-signal P , for some α > 0; see Definition 3. Put ϵ, δ ∈ (0, 1/10], and let λ := c * ϵ 2 (α + 1) k 2 log(α + 1) + log(1/δ) , where c * ≥ 1 is a constant that can be determined from the proof. Let (B, C, w) be the output of a call to SAMPLE-CORESET(B, λ); see Algorithm 2. Then, Claims (i)-(ii) hold as follows: (i) B, C, w can be stored using O (λ + βk) memory. (ii) With probability at least 1 -δ, we have that (B, C, w) is an ϵ-coreset of P ; see Definition 4. Proof. We have (i) by the construction of Algorithm 2 and the definitions in the theorem. Let c as computed in the call to SAMPLE-CORESET(B, λ); see Algorithm 2. Since B is an (α, β)approximation of P we have that c ̸ = ∞. If c = 0, then the theorem holds by the construction of Algorithm 2. Hence, we assume this is not the case. Let P ′ be the projection of P unto B, i.e., P ′ := x, ratio(q i , x) | i ∈ [β], (x, y) ∈ P ; see Definition 3. Let Q be the union of q = (c, c ′ ) ∈ R k 2 such that 1 + poly(c ′ , x) ̸ = 0 for every (x, y) ∈ P . Let q = (c, c ′ ) ∈ Q. We have p∈P D(q, p) - p∈C w(p) • D(q, p) - p∈P ′ D(q, p) = p∈P D(q, p) • p∈P D(q, p) - p∈P ′ D(q, p) p∈P D(q, p) - p∈C w(p) • D(q, p) p∈P D(q, p) , where the previous equality is by taking p∈P D(q, p) out of the sum. Let s : P → [0, ∞) as defined in Line 8 of Algorithm 2 in the call to SAMPLE-CORESET(B, λ), i.e., for every i ∈ [β] and any p ∈ P i we have s(p) = D(q i , p) β i=1 ℓ(P i , q i ) . ( ) Let i ∈ [β] and s * i : P i → [0, ∞) such that for every point p = (x, y) ∈ P i we have s * i (p) = D(q, p) -f qi (x) -f q (x) p∈P D(q, p) , which, due to the definition of P ′ as projection of P unto B, is an upper bound on the contribution of every point in P i to the sum in Equation 4. Let p = (x, y) ∈ P i , so that s * i (p) = D(q, p) -f qi (x) -f q (x) p∈P D(q, p) = y -f q (x) -f qi (x) -f q (x) p∈P D(q, p) ≤ y -f qi (x) p∈P D(q, p) = D(q i , p) p∈P D(q, p) ≤ α • s(p), where the first equality is by the definition of s * from Equation 6, the second equality is by the definition of D, the inequality is by the reverse triangle inequality, the third equality is by the definition of D, and the last equality is by Equation 5and the definition of B as an (α, β)-approximation of P . Let s : P → [0, ∞) such that for every p ∈ P we have s(p) = α • s(p). By Equation 7, for every i ∈ [β], we have that s is a sensitivity bound for s * i . For every p = (x, y) ∈ P i , i ∈ [β] and any q ∈ R k 2 let f (q, p) = D q, ratio(q i , x) . Let F = {f 1 , . . . , f n } , where f i (y) = f (q, p) for every p ∈ P i , i ∈ [β] and q ∈ R k 2 . Let k * ∈ O(k 2 ) be the dimension of the range space R P,F from Lemma 5 when assigning P and B. Substituting ϵ := ϵ, δ := δ, λ := λ, the query space (P, Q, F, ∥ • ∥ 1 ), d ′ ∈ O k 2 the VC-dimension induced by R k 2 and F from Lemma 5, the sensitivity bound s, and the total sensitivity t = α p∈P s(p) = α in Theorem 3, combined with the construction of Algorithm 2, yields that with probability at least 1 -δ, for every q ∈ Q, we have p∈P D(q, p) - p∈P ′ D(q, p) p∈P D(q, p) - p∈C w(p) • D(q, p) p∈P D(q, p) ≤ ϵ. Combining Equation 8and Equation 4 proves the theorem. C.4 LOWER BOUND. In this section we prove that there is no subset-ϵ-coreset for the query space P, R k 2 , D, ∥ • ∥ 1 , where P ⊂ R 2 is an n-signal P and D is as Definition 2, as defined in Definition 8. This justifies our Definition 4 of coreset for RFF. The main idea in the following claim is illustrated by figure 8 . Claim 1 (Minor modification of Claim 5 in Rosman et al. (2014) ). For every integer n ≥ 2 there is an n-signal P such that the following holds. For every C ⊆ R 2 , where |C| < n, there is q ∈ R k 2 such that p∈C D(q, p) ∈ [0, ∞) and p∈P D(q, p) = ∞. Proof. Let P = {(1, 0), • • • , (n, 0)} and C ⊆ R 2 , where |C| < n. Put (a, 0) ∈ P, a > 0 such that C ∩ {(a, y) | y ∈ R} = ∅; i.e., there is no point in C with x-value equal a. There is such a point since |C| < n = |P |. Let c = (1, 0, 0, • • • , 0), c ′ = - 1 a , 0, 0, • • • , 0 , and q = (c, c ′ ) ∈ R k 2 . Since D (a, 0), q = ratio(q, a) -0 = ∞ (observe that 1 + apoly(c ′ , a) = 1 -a/a = 0), and (a, 0) ∈ P , we obtain p∈P D(q, p) ≥ D (a, 0), q = ∞. On the other hand 1 - 1 a • x = 0 only for x = a, therefore ∀p ∈ C : D(q, p) ∈ [0, ∞), and since C is a finite set we obtain p∈C D(q, p) ∈ [0, ∞). The n-signal P = {(1, 0), • • • , (25, 0)} as in Claim 1 for n = 25 (black dots). Consider a subset C that contains all these points except a single one, say, the red point (15, 0). We can always find a rational function (query) whose sum of distances is close to zero for P , but close to ∞ for C, due to due its high change at the point (15, 0) that was not selected to C.

D INEFFICIENT SOLVER FOR THE RFF PROBLEM

In the following section we will prove that we can solve the RFF problem from Definition 2 in (2kn) O(k) time as previously mentioned in Section 2.1. The main idea is that, given an assignment of whether each point is below or above the best fitting rational function (to the loss in Equation 1 from Definitions 2) the problem can be written as a fractional polynomial programming problem and solved in polynomial (in the input size) time; see Pizzo et al. (2018) . Using previous work Marom & Feldman (2019) we can bound the number of candidate assignments mentioned above to be polynomial in the input size. Hence, by constructing a semi tree we can find all the satisfiable assignments in time polynomial in the input size, which enables us to compute an optimal solution in polynomial time. In the following section, for every x ∈ R let sign(x) = 1 if x > 0 and sign(x) = -1 otherwise. Theorem 4 (Theorem 23 in Marom & Feldman (2019) ). Let f 1 , • • • , f m be real polynomials in k < m variables, each of degree at most b ≥ 1. Then the number of sign sequences (sign(f 1 (x)), • • • , sign(f m (x)) over x ∈ R d that consist of the terms 1, -1 is at most 4ebm k k . To utilize this previous bound we state the following observation. Observation 1. Let P = {(x 1 , y 1 ), • • • , (x n , y n )} ⊂ R 2 . Let Q be the union of (c, c ′ ) ∈ R k 2 such that 1 + x • poly(c ′ , x) ̸ = 0 for every (x, y) ∈ P . There are g 1 , • • • , g n : R 2k → R polynomial of degree in O(n) with 2k variables such that for every (c, c ′ ) ∈ Q we have g i (c | c ′ ) = 1 + x • poly(c ′ , x) • poly(c, x) -y -y • x • poly(c ′ , x) . Proof. Let q := (c, c ′ ) ∈ Q. For every (x, y) ∈ P , by reorganizing the expression we have sign ratio (c, c ′ ), x -y = sign 1+x•poly(c ′ , x) • poly(c, x)-y -y •x•poly(c ′ , x) . (9) For every i ∈ [n], let g i : R 2k → R be a polynomial of degree in O(n) with 2k variables that maps every (c, c ′ ) ∈ R k 2 to g i (c | c ′ ) = 1 + x • poly(c ′ , x) • poly(c, x) -y -y • x • poly(c ′ , x) . By Equation 9we have that g 1 , • • • , g n : R 2k → R satisfy the observation. Using this observation and Theorem 4 we obtain the following result. Corollary 1. Let P = {(x 1 , y 1 ), • • • , (x n , y n )} ⊂ R 2 . Let Q be the union of (c, c ′ ) ∈ R k 2 such that 1+x•poly(c ′ , x) ̸ = 0 for every (x, y) ∈ P . The number of sign sequences sign ratio(q, x 1 )- y 1 , • • • , sign ratio(q, x n ) -y n over every q ∈ Q is in (2kn) O(k) . Proof. By Observation 1, let g 1 , • • • , g n : R 2k → R be real polynomials of degree in O(n) with 2k variables such that for every q := (c, c ′ ) ∈ Q we have sign ratio(q, x 1 ) - y i = sign g i (c | c ′ ) . By Claim 4, the number of sign sequences (sign(g 1 (x)), • • • , sign(g n (x)) over x ∈ R 2k that consist of the terms 1, -1 is in (2kn) O(k) . Combining this with Equation 10 proves the Corollary. While, as stated before, the number of possible positions of every points is indeed polynomial, there is the problem of computing this set in polynomial time. We will solve this by utilizing previous work on polynomial programming mentioned in Pizzo et al. (2018) . Observation 2. Let P = {(x 1 , y 1 ), • • • , (x n , y n )} ⊂ R 2 . Let Q ⊂ R k 2 be the union of every (c, c ′ ) ∈ R k 2 such that 1 + x • poly(c ′ , x ) ̸ = 0 for every (x, y) ∈ P . For every vector S ∈ {0, 1} n we can check in (2kn) O(k) the existence of q ∈ Q that satisfies S = sign ratio(q, x 1 )y 1 , • • • , sign ratio(q, x n ) -y n , i.e., satisfies the assignment by S. Proof. By Observation 1, let g 1 , • • • , g n : R 2k → R be real polynomials of degree in O(n) with 2k variables such that for every q := (c, c ′ ) ∈ Q we have sign ratio(q, x 1 ) - y i = sign g i (c | c ′ ) . Hence, for every assignment S ∈ {0, 1} n , there is q ∈ Q such that S = sign ratio(q, x 1 ) -y 1 , • • • , sign ratio(q, x n ) -y n if and only if there is x ∈ R d such that S = g 1 (x), • • • , g n (x) , and the later can be written as polynomial programming and thus solved numerically in (2kn) O(k) time; since this is equality can be written as sum of squares (SOS), see Pizzo et al. (2018) . Using this, in the following lemma, we prove that we can generate all the satisfiable options for the function to be above the points or below the points in polynomial time and not exponential. Lemma 7. Let P = {(x 1 , y 1 ), • • • , (x n , y n )} ⊂ R 2 . For every non empty C ⊂ P let Q(C) ⊂ R k 2 be the union of every (c, c ′ ) ∈ R k 2 such that 1 + x • poly(c ′ , x) ̸ = 0 for any (x, y) ∈ C. All the sign sequences sign ratio(q, x 1 ) -y 1 , • • • , sign ratio(q, x n ) -y n over every q ∈ Q(P ) can be computed in (2kn) O(k) time. Proof. By Corollary 1, for every C ⊂ P, |C| = m ≥ 2k there are at most m O(k) options for the sing sequence sign ratio(q, x) -y | (x, y) ∈ C over every q ∈ Q(C). Let C 1 , C 2 ⊂ P, C 1 ∩ C 2 = ∅ s.t. |C 1 |, |C 2 | ≥ 1, where we know all the satisfiable sign sequences sign ratio(q, x)) -y | (x, y) ∈ C over q ∈ Q(C) for C ∈ {C 1 , C 2 }. Since |C 1 |, |C 2 | ≤ n, by Corollary 1, the size of sign ratio(q, x)) -y | (x, y) ∈ C | q ∈ Q(C) is in (2kn) O(k) for C ∈ {C 1 , C 2 }. Hence, the size of the candidate sets for sign ratio(q, x)) -y | (x, y) ∈ C 1 ∪ C 2 | q ∈ Q(C 1 ∪ C 2 ) is in (2kn) O(k) . Therefore, by utilizing Corollary 2 to validate each candidate, we can compute in (2kn) O(k) the set sign ratio(q, x)) -y | (x, y) ∈ C 1 ∪ C 2 | q ∈ Q(C 1 ∪ C 2 ) . Thus, partitioning P to sets of size O(k) ≥ 2k, for each such C ⊂ P using Corollary 2 to compute in (2k) O(k) the set sign ratio(q, x)) -y | (x, y) ∈ C | q ∈ Q(C) , and combining all the options as stated above proves the lemma. Using this observation, and returning to the original problem, we obtain the following solver. Lemma 8. For every weighed set (S, w) that contains n = |S| points a pair q in arg min q∈(R k ) 2 ℓ (S, w), q can be computed in (2kn) O(k) time; see Definition 2. Proof. Let Q ⊂ R k 2 be the union of all the pairs (c, c ′ ) ∈ R k 2 such that 1 + x • poly(c ′ , x) ̸ = 0 for every (x, y) ∈ S. By Lemma 7, in (2kn) O(k) time, compute all the n O(k) possible sign sequences sign ratio(q ′ , x 1 ) -y 1 , • • • , sign ratio(q ′ , x n ) -y n over every q ′ ∈ R k 2 . For every such sign sequences w ′ , where w ′ (p) ∈ {-1, 1} is the sign of every point p ∈ P , observe the following: For every q := (c, c ′ ) ∈ Q satisfying the assignment by w ′ we have ℓ (S, w), q = p=(x,y)∈S w(p)w ′ (p) • poly(c, x 1 + x • poly(c ′ , x) -y . Hence, by taking the common denominator in the right part of the equation above, there are polynomials f, g : R k 2 → R of degree in (2kn) O(1) with R k 2 as the variables and (2kn) O(1) parameters, such that for every q ∈ Q we have ℓ (S, w), q = f i (q) g(q) . Hence, utilizing Pizzo et al. (2018) , this problem can be solved in (2kn) O(k) time; the solution is a numerical solution that can be approximated to arbitrary precision. Hence, by taking the minimum over the (2kn) O(k) candidate solutions, we can compute q as defined in the lemma in (2kn) O(k) time.

E CORESET UNDER CONSTRAINTS; ANALYSIS OF ALGORITHM 4

In this section we will prove that Algorithm 4 constructs a restricted coreset for rational functions, which, as previously mentioned in Section 2.2, will be utilized to efficiently compute an (α, β)approximation to a given n-signal P , where α, β ∈ O(log(n)). It should be emphasized that the final coreset construction has no such restrictions or assumptions for either its input or queries. For readability we split the proof into three parts: (i) Mostly citing previous work, we bound the polynomial-fitting sensitivity. (ii) Utilizing the previous bound we compute a sensitivity for a restricted case of the RFF fitting problem from Equation 1of Definition 2 that is formally sated and bounded in Lemma 10. (iii) We utilize the previous bound in Lemma 13, which proves the previously stated Lemma 3 that summarises the desired properties of Algorithm 4.

E.1 UPPER BOUND ON THE POLYNOMIAL-FITTING SENSITIVITY

For the polynomial-fitting sensitivity, consider the following lemma that follows from the work on sensitivity of near-convex functions by Murad Tukan Tukan et al. (2020) Lemma 9 (Lemma 35 in Tukan et al. (2020) ). Let Y be a set of n points in R d . A function s ′ : Y → [0, ∞) can be computed in O n • d 2 time such that for every x ∈ Y we have sup q∈R d |x T • q| y∈Y |y T • q| | ≤ s ′ (x) , where the supremum is over q ∈ R d such that |x T • q| > 0, and x∈Y s ′ (x) ∈ O d 3/2 . Using this lemma we obtain the following corollary. Corollary 2. Let X be a set of n ≥ 1 reals, and k ≥ 1 be an integer. A function s : X → [0, ∞) can be computed in O n • k 2 time such that for every x ∈ X we have sup c∈R k poly(c, x) y∈X poly(c, y) ≤ s(x), where the supremum is over q ∈ R d such that poly(c, x) > 0, and x∈X s(x) ∈ O k 3/2 . Proof. Let f : X → R k be the function that maps every x ∈ X to (1, x, • • • , x k-1 ) T . Let Y := {f (x) | x ∈ X} denote the image of f , and let s ′ : Y → [0, ∞) as defined in Lemma 9. For every x ∈ X, c ∈ R k , where poly(c, x) > 0, we have poly(c, x) y∈X poly(c, y) = f (x) T • c y∈X f (y) T • c ≤ s ′ f (x) , where the first inequality is by the definition of poly, and the second inequality is by the definition of s ′ ; see Lemma 9. Let s : X → [0, ∞) be the function that maps every y ∈ X to s(y) := s ′ f (y) . By Lemma 9, s satisfies all the claims in the corollary. Computation time of s. For every x ∈ X we can compute f (x) in O k 2 time. Hence, the computation time of Y is in O nk 2 . By Lemma 9, the computation time of s ′ is in O n • k 2 . Therefore, since, for every y ∈ X we defined s(y) = s ′ f (y) , we have that the computation time of s is in O n • k 2 .

E.2 UPPER BOUND ON THE RFF SENSITIVITY

In this section we the RFF sensitivity for the restricted case mentioned in property (ii) at the beginning of Section E. For this we we define the following. Definition 11 (Lipschitz-function). Let r > 0, and a, b ∈ R, where a > b. a, b] , and for every c ≥ 1, and any A function f : [a, b] → [0, ∞) is a r-Lipschitz if f is non-decreasing over [ x ∈ [a, b] we have f (c • x) ≤ c r • f (x). In the following claim we state a known property of polynomial functions. Using the claim above, we obtain the following corollary. Corollary 3. Let c ′ ∈ R k . Let G ⊆ R denote the extrema of the function g that maps every x ∈ R to g(x) = |1 + x • poly(c ′ , x)|. Let X = [a, b] ⊂ R, |X| > 1 such that for every x ∈ X we have min γ∈G |x -γ| ≥ max(X) -min(X). Then the function f : X → R that maps every x ∈ X to f (x) = 1 1 + x • poly(c ′ , x) is well defined, and satisfies max x∈X |f (x)| min x∈X |f (x)| ≤ 2 k . Proof. By the definition of X, if there is x ∈ X such that x ∈ G, then by substituting 0 = min γ∈G |x -γ| ≥ max(X) -min(X) we have that max(X) = min(X) and as such |X| = 1, which contradicts the definition of X in the claim. Therefore, we have X ∩ G = ∅, that is g has no extrema over X. If there is x ∈ X such that g(x) = 0, by the definition of g we have that x is an extrema of the function g. Hence, there is no x ∈ X such that g(x) = 0, which yields that the function f defined in the claim is well defined. Since g has no extrema over X we can prove that Case (ii): g increases in the range of X. Let x, y ∈ X such that x < y. By the assumptions (and Claim 2) |1 + x • poly(c ′ , x)| ≤ 2 k • |1 + x • poly(c ′ , y)|, f (x) ≥ 2 k • f (y) , where the first inequality is by Claim 2, and the second follows from the definition of f and dividing both sides by g(x) • g(y). Hence, max x∈X |f (x)| ≤ 2 k min x∈X |f (x)|. Case (iii): g decreases in the range of X. Let x, y ∈ X such that x < y. By the assumptions (and Claim 2) |1 + x • poly(c ′ , y)| ≥ 2 k • |1 + x • poly(c ′ , x)|, f (y) ≤ 2 k • f (x) , where the first inequality is by Claim 2, and the second follows from the definition of f and dividing both sides by g(x) • g(y). Hence, max x∈X |f (x)| ≤ 2 k min x∈X |f (x)|. Using Corollary 3 we obtain the following bound for the RFF sensitivity. Lemma 10. Let q 1 = (c 1 , c ′ 1 ) ∈ R k 2 . Let D ⊆ R denote the extrema of the function g that maps every x ∈ R to g(x) = |1 + x • poly(c ′ 1 , x)|. Let X ⊂ R such that for every x ∈ [min(X), max(X)] we have that min γ∈D |x -γ| ≥ max(X) -min(X). Let q 2 ∈ R k 2 be 2 k -bounded over X; see Definition 5. Let s : X → [0, ∞) be the sensitivity bound computed in Corollary 2 after substituting k with 2k + 1 and Y by X. For every x ∈ X we have ratio(q 1 , x) -ratio(q 2 , x) y∈X ratio(q 1 , y) -ratio(q 2 , y) ≤ 4 k • s(x). Proof. If |X| = 1, by the construction of s in Corollary 2 we have that the single value x ∈ X satisfies s(x) ≥ 1, which yields that the inequality in the lemma holds for this case. Therefore, from now on we assume that this is not the case, i.e., we assume that |X| > 1. Identify q 2 = (c 2 , c ′ 2 ), and let c ∈ R 2k+1 such that for every x ∈ R we have poly(c, x) = poly(c 1 , x) • 1 + x • poly(c ′ 2 , x) -poly(c 2 , x) • 1 + x • poly(c ′ 1 , x) . Let f 1 , f 2 : R → [0, ∞) denote the functions that map every y ∈ X to f 1 (y) = 1 1 + y • poly(c ′ 1 , y) and f 2 (y) = 1 1 + y • poly(c ′ 2 , x) , respectively. Let x ∈ X. We have ratio(q 1 , x) -ratio(q 2 , x) = poly(c 1 , x) 1 + x • poly(c ′ 1 , x) - poly(c 2 , x) 1 + x • poly(c ′ 2 , x) = ( poly(c, x) 1 + x • poly(c ′ 1 , x) • 1 + x • poly(c ′ 2 , x) = poly(c, x) • f 1 (x) • f 2 (x) , where the second equality is by assigning the definition of c, and the third equality is by assigning the definition of f 1 and f 2 . Substituting X by [min(X), max(X)] and c ′ by c ′ 1 in Corollary 3 yields that the function f : X → R that maps every x ∈ X to f (x) = 1 1 + x • poly(c ′ , x) is well defined, and satisfies max x∈X |f (x)| min x∈X |f (x)| ≤ 2 k . That is, since q 1 = (c 1 , c ′ 1 ) , that q 1 is 2 k well behaved over X; see Definition 5. Hence, ratio(q 1 , x) -ratio(q 2 , x) y∈X ratio(q 1 , y) -ratio(q 2 , y) = poly(c, x) • f 1 (x) • f 2 (x) y∈X poly(c, y) • f 1 (y) • f 2 (y) (12) ≤ max y∈X |f 1 (y)| min y∈X |f 1 (y)| • max y∈X |f 2 (y)| min y∈X |f 2 (y)| • |poly(c, x)| y∈X |poly(c, y)| (13) ≤ 4 k • |poly(c, x)| y∈X |poly(c, y)| (14) ≤ 4 k • s(x), where Equation 12is by Equation 11, Equation 13 follows from assigning that x ∈ X, Equation 14 holds since q 1 and q 2 are 2 k -bounded over X, and Equation 15 is by assigning the definition of s in the lemma.

E.3 CORRECTNESS OF ALGORITHM 4; PROOF OF LEMMA 3

In the following lemma we show a minor result that was used in Algorithm 4. Lemma 11. Let c ∈ R k and X = {a, a + 1 • • • , b} ⊂ [n] be a non empty interval of [n]. Let f : R → [0, ∞) be the function that maps every x ∈ R to f (x) = |1 + x • poly(c, x)|. There is a partition {X 1 , • • • , X η } of X into |η| ≤ 2k -1 sets, such that for every i ∈ [η] the function f is monotonic over min(X i ), max(X i ) , and for every i, j ∈ [η], where i ̸ = j we have X i ∩ min(X j ), max(X j ) = ∅. Moreover, this partition can be computed in (k + 1) O(1) • |X| time. Proof. Let g : R → [0, ∞) be the function that maps every x ∈ R to g(x) = 1 + x • poly(c, x), which is an polynomial of degree at most k. Observe that, by the fundamental theorem of algebra, any non zero polynomial (due to its construction g is non zero) of degree at most k has at most k roots. Thus, since the derivative of g is a polynomial of degree at most k -1, g has at most k -1 extrema. Hence, as any extrema of f is either a root or extrema of g, f has at most 2k -1 extrema. Partitioning X according to the 2k -1 extrema of f yields the partition {X 1 , • • • , X η } from the lemma. Running time: Observe that in the root finding presented above (including in the computation of the extrema of g) it suffices to only search for roots in the range of X, and for the non integer roots only for which integer a they are in (a, a + 1). Each integer candidate for the roots can be validated by a simple assignment in the polynomial in (k + 1) Lemma 5 bounds the VC-dimension that corresponds to the function D(q, p i ) over the points p 1 , • • • , p n in a n-signal projected unto its bicriteria; see Definition 2 and Definition 3. We now give a similar bound for the distance function D(q, (x, y)) = |ratio(q, x) -y| between every rational function q to the point p i = (x, y), where the points are a general set of points in the plane. The following lemma, similarly to Lemma 5, is inspired by Theorem 12 in Lucic et al. (2017) . Lemma 12. Let P = {(1, y 1 ), • • • , (n, y n )} be an n-signal. For every p i = (i, y) ∈ P and any q ∈ R k 2 let g i (q) = D(q, p i ) = |ratio(q, i) -y|; see Definition 2. Let G = {g 1 , . . . , g n }. The dimension of the range space R (R k ) 2 ,G that is induced by R k 2 and G is in O(k 2 ). Proof. For every (q, r) = c, c ′ ), r ∈ R k 2 × R, let h (c|c ′ |r) : R → {0, 1} that maps every i ∈ [n] to h (c|c ′ |r) (i) = 1 if and only if f i (q) ≤ r, and every x ∈ R \ [n] to h (c|c ′ |r) (x) = 0. Let H = {h θ | θ ∈ R 2k+1 }. For every c ∈ R k and any x ∈ R we can compute poly(c, x) with O(k) arithmetic operations on real numbers and jumps conditioned on comparisons of real numbers; see, for example, Horner's scheme Neumaier (2001) , which is used in numpy's implementation of the method polyval Harris et al. (2020) . Therefore, for every x ∈ R and any θ ∈ R 2k+1 , by the definition of D, we can calculate h θ (x) with O(k) arithmetic operations on real numbers and jumps conditioned on comparisons of real numbers. Hence, substituting d := n, m := 2k + 1, h := h, H := H and t ∈ O(k) in Theorem 2 yields that the VC-dimension of H is in O(k 2 ). Hence, by the construction of H and the definition of range spaces in Definition 7, we have that the dimension of the range space R P,F that is induced by P and F is in O(k 2 ). This combined with the previous results yields the following restricted coreset construction that utilizes the previous reduction of the RFF sensitivity to polynomial sensitivity. Lemma 13. Let an interval of a n-signal P which is projected unto some q ∈ R k 2 , i.e., ℓ(P, q) = 0. Let B := {(P, q)}, which is a (0, 1)-approximation B of P ; see Definition 3. Let X be the first coordinate of P , i.e., X := {x | (x, y) ∈ P }. Put ϵ, δ ∈ (0, 1/10], and let λ ≥ c * ϵ 2 • (4 k+1 k 2 + 1) k 2 log(4 k+1 k 2 + 1) + log k log n δ , be an integer, where c * ≥ 1 is a constant that can that can be determined from the proof. Let (S, w) be the weighted set that is returned by a call to MINI-REDUCE(B, λ); see Algorithm 4. Then |S| ∈ O (kλ • log n) and, with probability at least 1 -δ, for every q ′ ∈ R k 2 that is 2 k -bounded over X (see Definition 5), we have |ℓ(P, q ′ ) -ℓ ((S, w), q ′ )| ≤ ϵ • ℓ(P, q ′ ). Proof. By the construction of Algorithm 4 we have that |S| ∈ O (kλ • log n), which follows from the bounds on the order of η and every m i stated in Algorithm 4. Let δ ′ := ⌈c * 2 (k log n)/δ⌉ for a constant c * 2 > 0 that can be determined from the proof, more specifically see Equation 19. Consider the set X j i ⊂ X, for i ∈ [η] and j ∈ [m i ], that was constructed during the execution of the i-th iteration of the outer "for" loop and j-th iteration of the inner "for" loop in the call to MINI-REDUCE(B, λ). Let (c, c ′ ) := q, and let D ⊆ R denote the extrema of the function f that maps every x ∈ R to f (x) = |1 + x • poly(c ′ , x)|. By the construction of Algorithm 4, the size or diameter of X j i is smaller than its distance from the edges of X i , i.e., |X i j | ≤ min |x -max(X j i ), ||x -min(x j i )| . Since X i has no extreme points, this implies that for every x ∈ X j i we have min γ∈D |x -γ| ≥ |X j i | = max(X j i ) -min(X j i ). Hence, assigning X := X j i , q 1 := q, q 2 : -q, and the function s : X j i → [0, ∞) computed in the call to MINI-REDUCE(B, λ) for X j i in Lemma 10 yields for every x ∈ X j i that ratio(q, x) -ratio(q ′ , x) y∈X j i ratio(q, y) -ratio(q ′ , y) ≤ 4 k • s(x). Let P j i and S j i be as defined in Line 10 and Line 11, respectively, during the execution of the call to MINI-REDUCE(B, λ), and let w j i : S j i → [0, ∞) such that for every p ∈ S j i we have w j i (p) = w(p). Let Q denote the union over every q ′ ∈ R k 2 that is 2 k -bounded over X; see Definition 5. We now prove that S j i is an ϵ-subset corset for the query space P j i , Q, D, ∥ • ∥ 1 . Indeed, the cor- responding dimension of R Q,G , where is G is as defined in Lemma 12, is in O(k 2 ). The sensitivity of every (x, y) := p ∈ P j i is s(x) = max q∈Q D(q, p)/ℓ(P j i , q) ≤ 4 k s(x) where the inequality is by Equation 17. The total sensitivity is thus t = x∈X j i 4 k s(x) ∈ 4 k • O(k 2 ) , where the is by the inequality is by definition of s computed in the call to MINI-REDUCE(B, λ). Thus, substituting δ := δ ′ , the query space, P j i , Q, D, ∥ • ∥ 1 in Theorem 3 combined with the construction of Algorithm 4 yields that, with probability at least 1 -δ ′ , we have (S j i , w j i ) is an ϵ-subset-coreset for the query space P j i , Q, D, ∥ • ∥ 1 . That is, with probability at least 1 -δ ′ , for every q ′ ∈ R k 2 we have ℓ(P j i , q ′ ) -ℓ (S j i , w j i ), q ′ ≤ ϵ • ℓ(P j i , q ′ ). Taking the union over every i ∈ [η] and j ∈ [m i ], under the assumption that Equation 18 holds for each pair, yields ℓ(P, q ′ ) -ℓ (S, w), q ′ = η i=1 mi j=1 ℓ P (j) i , q ′ -ℓ S (j) i , w (j) i , q ′ ≤ η i=1 mi j=1 ℓ P (j) i , q ′ -ℓ S (j) i , w (j) i , q ′ ≤ η i=1 mi j=1 ϵ • ℓ P (j) i , q ′ = ϵ • ℓ(P, q ′ ), where the first equality is by the construction of the weighed set (S, w) and the partition P (j) i of P in Algorithm 4, the first inequality is by the triangle inequality, the second inequality is by assigning Equation 18(which was assumed to hold for all the values), and the last equality follows since P (j i ) computed in Algorithm 4 is a partition of P . Since i ∈ [η] by Line 5 and j ∈ [m i ] by Line 7 of Algorithm 4, we have at most O(k log n) sets P i,j ; follows by assigning the bounds on the order of η, and every m i from Algorithm 4. By the union bound, Equation 18 hold simultaneously for every i ∈ [η] and j ∈ [m i ] with probability at least 1 -δ ′ • η i=1 m i ≥ 1 -δ, which holds for δ ′ := ⌈c * 2 (k log n)/δ⌉ for some constant c * 2 > 0. Hence, with probability at least 1 -δ we have |ℓ(P, q ′ ) -ℓ ((S, w), q ′ )| ≤ ϵ • ℓ(P, q ′ ) as stated in Equation 16 which proves the lemma.

F COMBINING (α, β)-APPROXIMATIONS; ANALYSIS OF ALGORITHM 3

The input for the algorithm is P an interval of n-signal which is projected onto some set of (α, β)approximations; see Definition 3. This projection is represented by the set B, where each element B i ∈ B is a (0, β)-approximation for some P i , and {P 1 , • • • , P |B| } is a consecutive partition of P . The algorithm returns B ′ , a bicriteria-approximation of P as in Definition 3, where the size of B ′ is smaller than |B| i=1 |B i |. The algorithm runs in O(|P | 1+ϵ ) time, for every constant ϵ > 0. The desired properties of Algorithm 3 are stated and proved in Lemma 14. For the sake of analysis, we will use the following corollary. Corollary 4. Let {R 1 , • • • , R β } be a set of β ≥ 6k equally sized distinct partitions of [n] such that for every i, j ∈ [β], i ̸ = j we have R i ∩ min(R j ), max(R j ) = ∅. For every q ∈ R k 2 , there is C ⊂ [β], |C| = β -6k + 3 such that q is 2 k -bounded over every R i , for every i ∈ C. Proof. Let q = (c, c ′ ), and let G be the set of the extrema of the function f : R → R that maps every x ∈ R to f (x) = |1 + x • poly(c ′ , x)|. Let r = |R 1 | = • • • = |R β |. W.l.o.g. assume that for every i ∈ [β -1] we have min(R i+1 ) > max(R i ); i.e. , the sets are ordered in an increasing order. Let R 0 = {min(R 1 ) -r, • • • , min(R 1 ) -1} and R β+1 = {max(R β ) + 1, • • • , max(R β ) + 1 + r}. By the proof of Lemma 11, f has at most 2k -1 extrema. Therefore, removing from [β] every index i ∈ [β] such that f has an extrema in the range of either from {R i-1 , R i , R i+1 } yields a set C ⊂ [β], |C| = β -6k + 3 such that for every i ∈ C we have ∀x ∈ R i : min g∈G |x -g| ≥ r = max(R i ) -min(R i ). For every i ∈ C, substituting c ′ , G and X := R i in Corollary 3 yields that the function g : X → R that maps every x ∈ R i to g(x) = 1 1 + x • poly(c ′ , x) is well defined, and satisfies max x∈X |g(x)| min x∈X |g(x)| ≤ 2 k . Thus, q is 2 k -bounded over R i for every i ∈ C; see Definition 5. Hence, C satisfies the corollary. Combining a removal inspired by Rosman et al. (2014) , with the coreset construction from the previous section, yields the following lemma. Lemma 14. Let B := {B 1 , • • • , B β }, where each B i ∈ B is an (0, r i )-approximation of P i , i.e. P i is projected unto B i , and {P 1 , • • • , P β } is an equally-sized consecutive partition of P , some interval of an n-signal; see Figure 6 and Definition 3. Put ϵ, δ ∈ (0, 1/10], and let λ := c * ϵ 2 (4 k+1 k 2 + 1) k 2 log 2 (4 k+1 k 2 + 1) + log 2 kn δ be an integer, where c * ≥ 1 is a constant that can be determined from the proof. Let B ′ be the output of REDUCE(B, λ, 6k -3); see Algorithm 3. With probability at least 1 -δ, we have that B ′ is a (1 + 10ϵ, β * )-approximation of P for some β * ≥ 1; see Definition 3. Moreover, for ϵ = 1/10 we have that the running time of the call to REDUCE(B, λ, 6k -3) is in k) , where |P | • β 6k-3 • 4 O(k 2 ) (log (n/δ)β ′ ) O( β ′ = β i=1 |B i |. Proof. If β ≤ 12k -6, by the construction of B ′ in Algorithm 3 in Lines 1 and 20, we have B ′ := β i=1 B i . Since B i is a (0, r i )-approximation of P i for every i ∈ [β] and {P 1 , • • • , P β } is a partition of P , B ′ is an (0, |B ′ |) -approximation of P which yields that the lemma trivially holds. Thus, from now on, we assume that this is not the case. For every B i ∈ B identify B (1) i , • • • , B i := B i . Let q ∈ arg min q∈(R k ) 2 ℓ(P, q). For ev- ery i ∈ [β], let R i := {x | (x, y) ∈ P i }, i.e . the first coordinates of the points in P i . Since {P 1 , • • • , P β } is an equally-sized consecutive partition of some interval of an n-signal P , we have that {R 1 , • • • , R β } is a set of equally-sized partitions of [n] such that for every i, j ∈ [β], i ̸ = j we have R i ∩ min(R j ), max(R j ) = ∅. By Corollary 4, there is G ⊂ [β], |G| = β -6k + 3 such that q is 2 k -bounded over R i for every i ∈ G. Consider the integration of the "for" loop in the call to Algorithm 3 where G is computed, i.e., G := G. Let G ′ , S G , w G , and q G be defined as in the "for" loop iteration in the call to Algorithm 3. Let P G and P G ′ be the union of P i over every i ∈ G and i ∈ G ′ , respectively. Let P G\G ′ denote P G ′ ∪ P \ P G . The set B (j) i | i ∈ G ′ ∪ [β] \ G , B (j) i ∈ B i is a 0, β i=1 |B i | -approximation for P G\G ′ . Hence, by the construction of B ′ in Line 20 of Algorithm 3 it is left to prove that ℓ(P G\G ′ , q G ) ≤ (1 + 10ϵ) • ℓ(P G , q) ≤ (1 + 10ϵ) • ℓ(P, q). The last inequality trivially holds since P G ⊆ P . It is left to prove the first inequality of Eq. Equation 20. By Corollary 4, where we substitute β by |C|, there is a set G * ⊂ G of size |G * | = |G| -6k + 3 such that q G is 2 k -bounded over R i for every i ∈ G * . For every i ∈ G and any B (j) i ∈ B i identify P (j) i , q (j) i := B (j) i . For every i ∈ G and any i , δ = δ/(2n), q ′ = q G , and X := x | (x, y) ∈ P i , q G -ℓ S (j) B (j) i ∈ B i let S (j) i , w i , w (j) i , q G ≤ ϵ • ℓ P (j) i , q G . ( ) Let S * := i∈G\G * , B (j) i ∈ B i S (j) i , and w * : S * → [0, ∞) be the function that maps every p ∈ S * to w * (p) = w G (p). Applying Equation 21for every i ∈ G * and j ∈ [|B i |], along with utilizing the union bound, yields that, with probability at least 1 -δ/(2n) n ≥ 1 -δ/2, we have ℓ(P G\G * , q G ) -ℓ (S * , w * ), q G = i∈G * |Bi| j=1 ℓ P (j) i , q G -ℓ S (j) i , w (j) i , q G (22) ≤ i∈G * |Bi| j=1 ℓ P (j) i , q G -ℓ S (j) i , w (j) i , q G (23) ≤ ϵ • ℓ(P G\G * , q G ), where Equation 22is by the constructions of P G * and S * , Equation 23 is by the triangle inequality, and Equation 24 is by Equation 21. Substituting P = P (j) i , δ = δ/(2n), q ′ = q, and X := x | (x, y) ∈ P (j) i (the first coordinate of P (j) i ) in Lemma 13 for every i ∈ G and B (j) i ∈ B i (by the choice of G, q is 2 k -bounded over R i , which contains the first coordinate of P (j) i ) yields that with probability at least 1 -δ 2n we have ℓ P (j) i , q -ℓ S j i , w j i , q ≤ ϵ • ℓ P (j) i , q . Applying Equation 25for every i ∈ G and B (j) i ∈ B i , along with utilizing the union bound, yields that, with probability at least 1 -δ/(2n) n ≥ 1 -δ/2, we have ℓ P G , q -ℓ (S G , w), q = i∈G |Bi| j=1 ℓ P (j) i , q -ℓ S (j) i , w (j) i , q ≤ i∈G |Bi| j=1 ℓ P (j) i , q -ℓ S (j) i , w (j) i , q ≤ ϵ • ℓ P G , q , where Equation 26is by the constructions of the P G and S G in Algorithm 3, Equation 27is by the triangle inequality, and Equation 28 is by Equation 21. If both Equation 22to Equation 24and Equation 26 to Equation 28 holds, which happens with probability at least 1 -δ, we have ℓ(P G\G ′ , q G ) ≤ ℓ(P G\G * , q G ) (29) ≤ 1 1 -ϵ • ℓ (S * , w * ), q G (30) ≤ 1 1 -ϵ • ℓ (S, w), q G (31) ≤ 1 1 -ϵ • ℓ (S, w), q * (32) ≤ 1 + ϵ 1 -ϵ • ℓ(P G , q * ) (33) ≤ (1 + 10ϵ) • ℓ(P G , q * ), where Equation 29is by the choice of G ′ in Line 16 of Algorithm 3, Equation 30 is by Equation 22to Equation 24, Equation 31 is since S * ⊂ S, Equation 32 is since q ∈ arg min q ′ ∈(R k ) 2 ℓ (S, w), q ′ (by the construction of Algorithm 3), Equation 33is by Equation 26 to Equation 28, and Equation 34 holds since ϵ ≤ 1/2.

Running time:

Let β ′ := ∈ O β 6k-3 iterations of the "for" loop we obtain a total running time of |P | • β 6k-3 • 4 O(k 2 ) (log (n/δ)β ′ ) O(k) . G ALGORITHM 1: STREAMING In this section we prove that Algorithm 1, which gets P an n-signal of length n ≥ 2k is a power of 2, and ϵ, δ ∈ (0, 1/10], returns an ϵ-approximation of P with failure probability at most δ. The formal statement and its proof is given in Theorem 5. For the analysis we will use the following corollary, which is inspired by Lemma 3.6 in Braverman et al. (2020) and proves that projecting a dataset on a corresponding approximation for some function yields a coreset for this function. Corollary 5. Let P be an n-signal. Let B := {(P 1 , q 1 ), • • • , (P β , q β )} be an (α, β)-approximation of P ; see Definition 3. For every i ∈ [β] let P * i := x, ratio(q i , x) | (x, y) ∈ P i , and let P * := β i=1 P * i ; i.e., P * is the projection of P onto B. Let q = (c, c ′ ) ∈ R k 2 such that 1 + x • poly(c ′ , x) ̸ = 0 for every (x, •) ∈ P . Then ℓ(P * , q) ≤ (1 + α) • ℓ(P, q). Proof. Let q be defined in the corollary. We have ℓ(P * , q) = β i=1 (x,y)∈Pi ratio(q, x) -ratio(q i , x) ≤ β i=1 (x,y)∈Pi ratio(q i , x) -y + ratio(q, x) -y (36) = β i=1 p∈Pi D(q i , p) + p∈P D(q, p) ≤ (1 + α) • ℓ(P, q), where Equation 35is by assigning the definition of D and the definitions from the corollary, Equation 36 is by the triangle inequality, Equation 37 by the definition of D, and Equation 38 is by the definitions from the corollary. At first glance, it may seem that there would be a significant problem using the previously discussed variant of the Merge-Reduce scheme presented in Braverman et al. (2020) . Indeed, using the classic version where combining each pair of coresets would yield that the guarantee on the final approximation would be that the cost of the up to a polynomial in the data-set's size factor from the optimal solution (hence, the sample in SAMPLE-CORESET would be larger than the original dataset). This can be fixed by combining a significantly larger number of nodes, which by the following observation would yield that height of the tree would be ⌈log log n⌉, and as a consequence that the final approximation would be poly-logarithmic in the dataset size. The following theorem and Algorithm 1 are the main result of this work. Theorem 5. Let P be an n-signal, for n that is a power of 2, and put ϵ, δ ∈ (0, 1/10]. Let (B, C, w) be the output of a call to CORESET(P, k, ϵ, δ); see Algorithm 1. With probability at least 1 -δ, (B, C, w) is an ϵ-coreset of P ; see Definition 4. Moreover, the computation time of (B, C, w) is in 2 O(k 2 ) • n • n O(k)/ log log(n) • log(n) O(k log(k)) • log(1/δ) O(k) , and the memory words required to store (B, C, w) are in (2k) O(1) • log(n) O(1)+log(k) • log(1/δ)/ϵ 2 . Which, when considering k and log(1/δ) as constants yields that the running time is in O n 1+o(1) and the space is in O n o(1) /ϵ 2 . Proof. Let β := n 1/ log log(n) and β = ⌈n/β⌉ be the values that are defined in the call to CORESET(P, k, ϵ, δ) . Let (B 1 , • • • , B β ′ ) be the output of the call to BATCH-APPROX(P, β) in Line 4 of Algorithm 1; see Algorithm 5. Let i ∈ [β ′ ] and identify ( Pi , q i ) := B i . Let P i := (x, y) | (x, •) ∈ Pi , i.e., {P 1 , • • • , P ψ } is the partition computed in the call to BATCH-APPROX(P, β). By the construction B i in Line 3 of Algorithm 5 we have q i ∈ arg min q∈(R k ) 2 ℓ(P i , q). Let Q be the union over the pairs (c, c ′ ) ∈ R k 2 such that 1 + x • poly(c ′ , x) ̸ = 0 for every (x, •) ∈ P i . Hence, for every q ∈ Q by assigning P := P i , B := {(P, q)} , P * := Pi and α, β = 1 in Corollary 5, we obtain ℓ( Pi , q) ≤ 2 • ℓ(P i , q). ( ) Let B ′ 1 , • • • , B ′ ψ be a set of biciterias, as defined as in Line 6 during the first iteration of the "while" loop in the call to Algorithm 1; see Definition 3. Let i ∈ [ψ], and P ′ := (Y,q ′ )∈B∈B ′ i Y , i.e., the union of all the sets of points in the bicritrias in B ′ i . Let B i := REDUCE (B ′ i , λ 1 ) as computed in Algorithm 1. Identify B i = ( P1 , q 1 ), • • • , ( Pr , q r ) , and put P ′ a := (x, y) ∈ P ′ | (x, •) ∈ Pa , for every a ∈ [r], i.e. every P ′ a is the set of points in P ′ that are approximated by q a . Replacing ϵ with 1/10, B with B ′ i and δ with δ/(4n) in Lemma 2 yields that there is λ 1 as defined in Line x of Algorithm 1 such that B i with probability at least 1 -δ/(4n) is an (2, |B i |)-approximation to P ′ . That is, with probability at least 1 -δ/(4n) we have r a=1 ℓ(P ′ a , q a ) ≤ 2 min q∈(R k ) 2 r a=1 ℓ(P ′ a , q). For every B a ∈ B ′ i , let P a = {(x, y) ∈ P | x ∈ P ′ a }, i.e., the points in P which are approximated by the biciteria B a ; see Definition 3. Plugging Equation 39 in the right side of Equation 40 yields that, with probability at least 1 -δ/(4n), we have By the construction of Algorithm 1 it trivially holds there are ⌈log log(n)⌉ -2 iterations of the "while" loop, and 2n calls to REDUCE. Hence, repeating the proof above recursively for every iteration of the "while" loop and the last call to REDUCE in Line 10 yields that with probability at least (1 -δ/(4n)) 2n ≥ 1 -δ/2 we have that B computed in Line 10 is an 3 ⌈log log(n)⌉ , β *approximation to P , for some integer β * ≥ 1. By the construction of Algorithms [3,1], and that there are ⌈log log(n)⌉ -2 iterations of the "while" loop we have β * ≤ (24k) ⌈log log(n)⌉ . Therefore, combining this with Lemma 1 yields that there is λ 2 as defined in the call to Algorithm 1, and proves the theorem. Space complexity. Let λ 1 as defined in Algorithm 1 (conditions were set in the previous part of the proof). From the previous section we have that B is a O log(n), log(n) O( 1)+log(k) -approximation of P . Hence, by the construction of Algorithm 2 and assigning λ 2 (conditions were set in the previous section) we have that (B, C, w) can be represented in (2k) O(1) •log(n) O(1)+k •log(1/δ)/ϵ 2 space, which proves the claim for memory size. Running time. Note that for every iteration of the "while" loop in the call to CORESET(P, k, ϵ, δ): • By Lemma 2 and the previous analysis, we have that each call to REDUCE(B ′ i , λ 1 ) in CORESET (P, k, ϵ, δ)  takes n i • β O(k) + (ββ * λ 1 ) O(k) , where n i = Bi∈B ′ i (Y,q ′ )∈Bi |Y |. • Put ψ ≤ n as the number of sets B ′ i computed in this iteration of the "while" loop. • Hence, since n = ψ i=1 n i , we have that the computation time of the iteration is in k) . ψ i=1 n ′ • β O(k) + (ββ * λ 1 ) O(k) = n • β O(k) + ψ(ββ * λ 1 ) O(k) ≤ n • β O(k) • (β * λ 1 ) O( By the computation of λ 1 , β in the call to CORESET(P, k, ϵ, δ), and that β * ∈ log(n) O( 1)+log(k) (from the previous sections) yields that the computation time of the iteration is in k) . 2 O(k 2 ) • n • n O(k)/ log log(n) • log(n) O(k log(k)) • log(1/δ) O( By the construction of Algorithm 1 there are ⌈log log(n)⌉ -2 iterations of the "while" loop. Hence, the while-loop takes k) time. Therefore, we have that the running time of the call to Algorithm 1 is in 2 O(k 2 ) • n • n O(k)/ log log(n) • log(n) O(k log(k)) • log(1/δ) O( 2 O(k 2 ) • n • n O(k)/ log log(n) • log(n) O(k log(k)) • log(1/δ) O(k) .

H FAST PRACTICAL HEURISTIC; ANALYSIS OF ALGORITHM 6

Unfortunately,since for the slow solver we used Pizzo et al. (2018) which utilizes polynomial programming which takes significant running time, the running time of our robust algorithms is still large. Therefore, we suggested a heuristic in Algorithm 6 to run on top of our coreset. In this section we prove that, under some assumptions, this heuristic gives a constant factor approximation. For the sake of the analysis of Algorithm 6 we prove the following result for fitting a hyperplane to a set of points.

H.1 FITTING PLANE TO POINTS

In Lemma 17 we will prove that a common heuristic for fitting hyperplanes to points does give approximation guarantees for points of bounded coordinates. This result would later be combined with some assumptions to give the desired approximation guarantees for Algorithm 6. Lemma 15. Let {(x 1 , y 1 ) • • • , (x n , y n )} be a set of n ≥ 1 points on the plane, where x i ̸ = 0 for every i ∈ [n], let a := 1 n n i=1 |x i |, and let k ≥ 1 an integer such that ∀i ∈ [n] : a k ≤ |x i | ≤ ka. For every i ∈ [n] let c i := y i x i , and let G := {c 1 , • • • , c n } be a multi-set of size n. Let c be a random item from G, sampled uniformly. Then, with probability at least 1/2 we have  (xi,yi)∈P |x i • c -y i | ≤ 2k 2 + 1 • min c ′ ∈R (xi,yi)∈P |x i • c ′ -y i |. ( -c * | ≤ 2 n c ′ ∈G |c * -c ′ | := k ′ . Suppose this event indeed occurs. For every (x, y) ∈ P and c ′ ∈ G such that xc ′ = y (there exist such c ′ by the definition of G in the lemma), we have |xc -y| = |xc -xc ′ | (44) = |x| • |c -c ′ | (45) = |x| • |c -c * -c ′ + c * | (46) ≤ |x| • |c * -c| + |c ′ -c * | (47) ≤ |x|k ′ + |x| • |c * -c ′ | (48) ≤ kk ′ a + |c * x -y|, where Equation 44is by the choice of c ′ ∈ G as a value satisfying xc ′ = y, Equation 45 and Equation 46 are by reorganizing the expression, Equation 47 is by the triangle inequality, and Equation 48is by the definition of k ′ from Equation 43, and Equation 49 is by assigning that |x| ≤ ka (from the definition of P in the lemma, substituting x i := x in Equation 41) and that xc ′ = y (from the choice of q ∈ G). Hence, (x,y)∈P |xc -y| - (x,y)∈P |xc * -y| ≤ nkk ′ a (50) = 2k n i=1 a|c i -c * | (51) = 2k • n i=1 a|x i c i -x i c * | |x i | (52) = 2k • (x,y)∈P a|y -xc * | |x| (53) ≤ 2k • (x,y)∈P k|y -xc * | (54) = 2k 2 • (x,y)∈P |xc * -y|, where Equation 50is by summing over every (x, y) ∈ P utilizing Equation 44 to Equation 49, Equation 51 is by assigning the definition of k ′ from Equation 43, Equation 52 is by multiplying and dividing the expression inside the sum by a i , Equation 53 follow from the definition of c i as y i x i in the lemma, Equation 54 is by Equation 41, and Equation 55 is by reorganizing the expression. By assuming that similar assumptions hold for every dimension separately we can generalize the previous lemma for higher dimensions, as done in the following lemma. Lemma 16. Let P = x (1) i , x i , • • • , x (d) i , b i n i=1 ⊂ R d × R be a set of n ≥ 1 points, where for every (i, j) ∈ [n] × [d] we have x (j) i ̸ = 0, let a (j) := 1 n n i=1 |a (j) i | for every j ∈ [d], and let k ≥ 1 be an integer such that ∀(i, j) ∈ [n] × [d] : a (j) k ≤ |a (j) i | ≤ a (j) k. For every S ⊂ P of size d let c S ∈ R d such that a T • c S -y = 0; there is such value by properties of linear regression. Let G be the multi-set {c S | S ⊂ P, |S| = d}. Let c be a random item from G, sampled uniformly. Then, with probability at least 2 -d we have (a,y)∈G a T i • c -y i ≤ 2k 2 + 1 d • min c ′ ∈R d (a,y)∈G a T i • c ′ -y i . (56) Proof. Let c ′ = (c ′ 1 , • • • , c ′ d ) ∈ R d , and j ′ ∈ [d]. For every i ∈ [n] let x i := x (j ′ ) i , let y ′ i := y i + d j=1 x i c ′ i -x i c ′ d , and let c i := y ′ i x i . Let P ′ := {(x 1 , y ′ 1 ), • • • , (x n , y ′ n )} , G := {c 1 , • • • , c n } be multi sets, both of size n. Let c be a random item from G ′ , sample uniformly, and let c * be c ′ were we substitute the j ′ th entry by c. By the definition of P ′ and c ′ we have (x,y)∈P |x T • c * -y| = (x,y)∈P ′ |x • c -y|. From the definition of P in the lemma, for every (i, j) ∈ [n] × [d] we have a (j) k ≤ a (j) i ≤ ka (j) . Hence, by the construction of P ′ we have a (j ′ ) := 1 n (x,y)∈P ′ |x|, and for every (x, y) ∈ P ′ we have a (j ′ ) k ≤ |x| ≤ ka (j ′ ) . Thus, by substituting P by P ′ and G by G ′ in Lemma 15, with probability at least 1/2, we have (x,y)∈P ′ |x • c -y| ≤ (2k 2 + 1) • (x,y)∈P ′ |x • c ′ i -y|. Combining this with Equation 57 yields that with probability at least 1/2 we have (x,y)∈P |x T • c * -y| ≤ (2k 2 + 1) • (x,y)∈P ′ |x • c ′ i -y| = (2k 2 + 1) (x,y)∈P |x T • c ′ -y|, where the equality is by the construction of P ′ . Let c ′ = (c ′ 1 , • • • , c ′ d ) ∈ arg min c ′ ∈R d (x,y)∈P |x T c ′ -y|, and let (c 1 , • • • , c d ) := c. For every j ∈ [d] let c * j = (c 1 , • • • , c j , c ′ j+1 , • • • , c ′ d ). Let c * 0 = c ′ . By the proof above, for every j ∈ [d], with probability at least 1/2, we have (x,y)∈P |x T • c * j -y| ≤ (2k 2 + 1) • (x,y)∈P |x T • c * j-1 -y|. Assigning c * 0 = c ′ ∈ arg min c ′ ∈R d (x,y)∈P |x T c ′ -y| and c * d = c in the above, combined with the construction of G in the lemma yields Equation 56. By repeatedly taking a sample from the set G in the lemma above we obtain the following lemma that gives the desired approximation guarantees for linear regression under assumptions. Lemma 17. Let P = a (1) i , a i , • • • , a (d) i , b i n i=1 ⊂ R d × R be a set of n ≥ 1 points, where there is an integer k ≥ 1 such that ∀(i, j) ∈ [n] × [d] : 1 nk n i ′ =1 |a (j) i ′ | ≤ |a (j) i | ≤ k n n i ′ =1 |a (j) i ′ |. For every S ⊂ P of size d let c S ∈ R d such that  T c -y| ≤ (2k 2 + 1) d • min c ′ ∈R d (x,y)∈P |x T c ′ -y|. Proof. By Lemma 16, for a uniformly sampled c ∈ G with probability most 1 -ϵ we have (x,y)∈P |x T c -y| > (2k 2 + 1) d • min c ′ ∈R d (x,y)∈P |x T c ′ -y|. Hence, by the union bound, the probability that Equation 60 holds for every c ∈ S is at most (1 -ϵ) λ ≤ (1 -ϵ) log 1-ϵ δ = δ, where the inequality is since λ ≥ log 1-ϵ δ.

H.2 SOLVER COMPUTATION; SEE DEFINITION 6

In this section we prove, for S ⊂ R 2 of size 2k satisfying {x • y | (x, y) ∈ S} = 2k, that SOLVER(S) is never empty and that it can be computed in O k 3 time; see Definition 6. This method is a generalization of a technique shown in NIST/SEMATECH (2021). While it is plausible that there are previous work that show the robustness of an equivalent method to ours, this section is still important for the self containment of the work. For this we define the following global definitions which would be used in the following section. Definition 12. Let S = (x 1 , y 1 ), (x 2 , y 2 ), • • • , (x 2k , y 2k ) ∈ R 2 , where {x • y | (x, y) ∈ S} and {x | (x, •) ∈ S} are both of size 2k. Let A 1 , A 2 ∈ R (2k)×k as follows A 1 =      1, x 1 , x 2 1 , • • • , x k-1 1 1, x 2 , x 2 2 , • • • , x k-1 2 . . . 1, x 2d , x 2 2d , • • • , x k-1 2k      , A 2 =      -y 1 • x 1 , -y 1 • x 2 1 , • • • , -y 1 • x k 1 -y 2 • x 2 , -y 2 • x 2 2 • • • , -y 2 • x k 2 . . . -y 2k • x 2k , -y 2k • x 2 2k • • • , -y 2k • x k 2k      . ( ) Let A = (A 1 | A 2 ) be a (2k) × (2k) matrix, and let x, ỹ = (x 1 , x 2 , • • • , x 2k ), (y 1 , y 2 , • • • , y 2k ). In the following lemma we derive conditions that any value that satisfies them is a candidate for an output to SOLVER(S). Fig. 9 presents the dataset readings along with the approximation computed in FRFF-coreset, Scipy's rational function fitting computed via Scipy.optimize.minimize, and a 3th degree polynomial computed using the numpy.polyfit function, which minimizes the sum of squared distances between the polynomial and the input. For fair comparison, all three methods have been allowed 4 free parameters. In particular, our rational function and Scipy's rational function have degree 1 in the enumerator and 2 in the denominator (there are 4 free parameters, since the free variable in the denominator is set to 1), while the polynomial is of degree 3. Observe that in the following examples the 3th degree polynomial yielded a slightly smaller loss than our method, this is in contrast to the case in the example in Fig. 4 where the 3th degree polynomial yielded significantly worse results. We believe that this occurred due the "un-noisy" data corresponding to a more "smooth" function, while in Fig. 4 the function was "less" smooth due to the data generation. Figure 9 : The n-signal corresponding to TEMP and DEWP properties from the year 2013 of Dataset Chen (2019), along with three fitted functions: (i) the rational function of degree 2 (i.e., degree 1 in the enumerator and 2 in the denominator) computed in our algorithm FRFF-coreset, (ii) the output of a call to Scipy.optimize.minimize that aims to fit a rational function of degree 2 to the input signal, and (iii) Polynomial of degree 3, computed using the numpy.polyfit function. For fair comparison, all three methods use 4 free parameters. In this section we include all the results computed for Chen (2019) in section 3.2. As there is relatively little change in the computational time plot over the sample sizes and features we show the mean results for the year 2016 for the temperature feature in the following In this test we will use the dataset from Vito (2016) which contains a "yearly measurement of of a gas multisensor device deployed on the field in an Italian city" (cited from Vito ( 2016)). Fig. 12 presents the dataset readings along with the approximation computed in FRFF-coreset, Scipy's rational function fitting computed via Scipy.optimize.minimize, and Polynomial of degree 3, computed using the numpy.polyfit function, which minimizes the sum of squared distances between the polynomial and the input. For fair comparison, all three methods have been allowed 4 free parameters. In particular, our rational function and Scipy's rational function have degree 1 in the enumerator and 2 in the denominator (there are 4 free parameters, since the free variable in the denominator is set to 1), while the polynomial is of degree 3. Observe (as in Fig. 9 ) that in the following examples the 3th degree polynomial yielded a slightly smaller loss than our method, this is in contrast to the case in the example in Fig. 4 where the 3th degree polynomial yielded significantly worse results. We believe that this occurred due the "unnoisy" data corresponding to a more "smooth" function, while in Fig. 4 the function was "less" smooth due to the data generation. Repeating the test in 3.2 for this data set yields the following plots. Due to there being a relatively little change in the computational time over the features and sample sizes we show the mean results in the following This figure demonstrates that rational function fitting is more suitable than polynomial fitting, for a data relatively normal dataset that is essentially is a set of samples from an exponential function. This also shows that computing any of those fitting functions either on top of the full data or our coreset produces similar results. J.2 SECTION 3.1 In this section we demonstrated that our algorithms achieved the best accuracy consistently across the varying signal's length, where RFF-coreset achieved the best performance and is followed by FRFF-coreset. The main results are summarised in Fig. 2 . While when considering Evaluation (i) RandomSample, that is essentially a random sample from the input, yielded a good quality, in Evaluation (ii) the values where so large that they where clipped. We, informally, believe that it accrued since the near optimal query considered in Evaluation (i) is not "very far" from any point in the generated data set and as such no point in the data is "very important" for the near optimal query, but as presented in Observation 1 this is not the case for all the queries or even for all the data sets (suppose a dataset where all the points lay on a query beside one point that has y-value approaching ∞, this single point would be "very important"). In this experiment we obtained significant speed improvement compared to all the other methods beside RandomSample and Gradient, where the later is based on SciPy's library Virtanen et al. (2020) function Scipy.optimize.minimize, and it can be expected since a random sample would obviously be efficient and the function Scipy.optimize.minimize (at least for our parameters and to the best of our knowledge) used optimization methods from Nocedal & Wright (2006) that at some cases can yield very fast a local minimum (but not a global one). Hence, while RandomSample and Gradient had lower running time as observed in Fig. 2 they had significantly worse results. Observe that while L ∞ Coreset which uses the guaranteed approximation for max deviation from Peiris et al. (2021) might seem as a valid heuristic at first glance, our results in Fig. 2 demonstrated that while it was the closest contestant to our methods in terms of quality, it came at the price of a significantly larger running time (it was the only time plot that was clipped). While it is plausible that it follows from our improper parameter tuning, we believe that it comes from the use linear programming solvers which to the best of our knowledge there are no solvers with a running time in O n 2 ; to the best of our knowledge, for the time of writing, the lowest bound is the one in Cohen et al. (2019) .

J.3 SECTION I

In the experiments we obtained very similar results as in the experiment in Section 3.1, which validates our observation above. We note that at some cases FRFF-coreset achieved better quality than RFF-coreset, where this mostly occurred for Evaluation (i). Informally we believe that this occurred since the approximation in FRFF-coreset was very similar to the near optimal query considered, while the BI-criteria in RFF-coreset might had lower loss but was farther from the



Figure 1: Illustration for Algorithm 1. (Top) In Line 4, an input n-signal P (black ticks and red dots) and its partition {P 1 , • • • , P 16 } into ψ = 16 sets via a call to Algorithm 5. This call also computes (1, 1)-approximation B i = {(P i , Q i )} for every P i (green curves), see Definition 3. In Line 6, the set {B 1, • • • , B 16 } is partitioned into B = B ′ 1 ∪ B ′ 2 ∪ B ′ 3 ∪ B ′ 4, where each such set contains 4 elements from B. (Middle) In Line 7, for every i ∈ [4] we set B i = P

onto B i is denoted in green. (Bottom) The process above is repeated, with B := {B 1 , B 2 , B 3 , B 4 }. B is therefore partitioned into sets of size β = 4, i.e., only 1 such set B, and then REDUCE(B) is called, which reduces the size of B to 5.

Figure 2: Results of the experiment from Section 3.1. (Left + Middle): The X-axis presents, on a logarithmic scale, the size of the input signal, and the Y -axis presents the approximation error of each compression scheme, for given compression sizes of 382 and 2824 respectively. The upper and lower rows present Evaluations (i) and (ii) respectively. (Right): The computational time for each of the two coreset sizes. The evaluation method does not affect those times.

Figure 3: Results of the experiment from Section 3.2. The X-axis presents the size of the compression -percentage from the original data, and the Y -axis presents the approximation error of each compression scheme. The top and bottom rows present Evaluations (i) and (ii) respectively.

Figure 4: RFF illustration.A time-series f (x) = e x/512 over 1, • • • , 2 12 in red, which is denoted by GT, an abbreviation of ground truth. The goal is to approximate it via: (black) a rational function computed using our algorithm FRFF-coreset from the Experimental Results (see Section 3), (blue) a rational function computed using the Scipy.optimize.minimize which aims to minimize the same RFF loss as in Equation1, and (green) a polynomial of degree 3 computed using the numpy.polyfit function, which minimizes the sum of squared distances between the polynomial and the input. For fair comparison, all three methods have been allowed 4 free parameters. (Left): All 3 methods applied to the original signal. (Right): A coreset of size < 10% of the input was first computed for the given signal via Algorithm 1. Then, all 3 methods were applied on the coreset points only. The error bars are from 10 experiments.

Figure 5: A set [β] for β = 30 of indices (black ticks), a set C = {c 1 , c 2 } ⊆ [β] of 2 indices, and a partition R 1 ∪ R 2 ∪ R 3 = [β] \ C of the indices not in C, as described in Line 17 of Algorithm 3.

Figure 7: An exponential partition X 1 i , • • • , X 8 i of a set X i = [30]; see Line 6 of Algorithm 4.

Definition (range spaceFeldman et al. (2013)). A range space is a pair (L, ranges) where L is a set, called ground set and ranges is a family (set) of subsets of L. Definition (dimension of range spacesFeldman et al. (2013)). The dimension of a range space (L, ranges) is the size |S| of the largest subset S ⊆ F such that|{S ∩ range | range ∈ ranges}| = 2 |S| .Definition 7 (range space of functionsFeldman et al. (2013);Har-Peled & Sharir (2009);Feldman & Langberg (2011)). Let F be a finite set of functions from a set

For every c ∈ R k and any x ∈ R we can compute poly(c, x) with O(k) arithmetic operations on real numbers and jumps conditioned on comparisons of real numbers; see, for example, Horner's schemeNeumaier (2001), which is used in numpy's implementation of the method polyvalHarris et al. (2020). Therefore, for every i ∈ [n] and any θ ∈ R 2k+1 , by the definition of D, we can calculate h θ (i) with O(k) arithmetic operations on real numbers and jumps conditioned on comparisons of real numbers. Hence, substituting d := n, m := 2k + 1, h := h, H := H, and t ∈ O(k) in Theorem 2 yields that the VC-dimension of H is in O(k 2 ). Hence, by the construction of H and the definition of range spaces in Definition 7, we have that the dimension of the range space R P,F that is induced by P and F is in O(k 2 ).

Figure 8: Visual illustration of the main idea behind Claim 1.The n-signal P = {(1, 0), • • • , (25, 0)} as in Claim 1 for n = 25 (black dots). Consider a subset C that contains all these points except a single one, say, the red point (15, 0). We can always find a rational function (query) whose sum of distances is close to zero for P , but close to ∞ for C, due to due its high change at the point (15, 0) that was not selected to C.

Let k ≥ 1 be an integer, and a, b ∈ R where a > b. Let f : [a, b] → (0, ∞) be a positive and non-decreasing polynomial over [a, b] of degree at most k. Then f is a k-Lipschitz function.

via the following Case (i): g is constant in the range of X. From the definition of the case and the definition of f it follows that f is constant in the range of X, and thus max x∈X |f (x)| = min x∈X |f (x)|.

i := MINI-REDUCE B (j) i, λ as computed in Line 5 of the call to Algorithm 3

Lemma 13 for every i ∈ G * and j ∈ [|B i |] (by the choice of G * , q G is 2 k -bounded over R i , which contains the first coordinate of P

6k -3, then, by the construction of Algorithm 3, the output can be computed inO(|P | • β ′ ).Hence, from now on we assume this is not the case. Consider a single "for" iteration over the values of G ⊂ [β] during the execution of Line 6 in the call to Algorithm 3. By the construction of every S λ, for every i ∈ [β] and j ∈ [|B i |]. Hence, since |S G | is the union of S (j) i , over every i ∈ G ⊂ [β] and j ∈ [|B i |], recalling the definition of β ′ trivially yields |S G | ∈ O (λβ ′ ). By Lemma 8, Line 11 can be computed in (2k|S G |) O(k) ∈ λ • β ′ O(k) time. The rest of the lines can be computed in |P | • (k + 1) O(1) time, therefore, every iteration can be computed |P | • (k + 1) O(1) + λ • β ′ O(1) time. Assigning ϵ = 1 10 in the definition of λ in the lemma yields λ ∈ 4 O(k) log (n/δ). Since there are β 6k -3

That is, with probability at least 1 -δ/(4n), B i is a (4,|B i |)-approximation to r a=1 P a . Let Q be the union over the pairs (c, c ′ ) ∈ R k 2 such that 1 + x • poly(c ′ , x) ̸ = 0 for every (x, •) ∈ P * i . Put. Hence, with probability at least 1 -δ/(4n), for every q ∈ Q by assigning P := r a=1 P a , B := B i , P * := r a=1 Pa , α = 4 and β = |B i | in Corollary 5, we obtain

(a,y)∈S a T • c -y ; there is such value by properties of linear regression. Let G be the multi-set that is the union over every S ⊂ P of size |S| = d. Let δ ∈ (0, 1), ϵ = 2 -d , and let λ := max log 1-ϵ (δ) , 1 . Let S ⊂ G, |S| = λ where each value is sampled i.i.d. and uniformly. With probability at least 1 -δ there is c ∈ S such that (x,y)∈P |x

Lemma 18. Suppose that there is b = (b 1 , b 2 , • • • , b 2k ) ∈ R 2k such that the following holds: (i) It holds that A • b = ỹ, i.e. b is the solution to the constructed linear equation.(ii) For every (x, y) ∈ S we have1 + x • poly (b k+1 , b k+2 , • • • , b 2k ), x ̸ = 0. Put c = (b 1 , b 2 , • • • , b k ) and c ′ = (b k+1 , b k+2 , • • • , b 2k ). We have that ∀p ∈ S : D(c, c ′ , p) = 0.Proof. Using the notation from the lemma we obtainA • b = ỹ (62) A 1 • c + A 2 • c ′ = ỹ (63) ∀(x, y) ∈ S : poly(c, x) -y • x • poly(c ′ , x) = y (64) ∀(x, y) ∈ S : poly(c, x) = y + y • x • poly(c ′ , x) (65) ∀(x, y) ∈ S : poly(c, x) = y • 1 + x • poly(c ′ , x)(66)∀(x, y) ∈ S : poly(c, x)1 + x • poly(c ′ , x) = y,(67)where Equation71is by the definition of ℓ in Definition 2, Equation 72 is by reorganizing the expression (taking common denominator), Equation 73 is by assigning the construction of xi in the lemma, Equation 74 is by assigning Equation 69, Equation 75 is by assigning Equation 70, Equation 76 is by moving 1 + x • poly(g, x) 1 + x • poly(c ′ , x) inside the absolute value which is a legal operation since ∀s, t ∈ R : |s| • |t| = |st|, Equation 77 is by reorganizing the expression and assigning the definition of xi , and Equation 78 is by the definition of ℓ in Definition 2. I FULL RESULTS FOR REAL LIFE DATA TESTS I.1 FULL RESULTS FOR THE TEST OVER THE DATASET CHEN (2019)

Figure 10: results for the experiment from Section 3.2. The X-axis presents, the size of the size of the compression, in percents of the original data, and the Y -axis presents the approximation error of each compression scheme, using Evaluation method (i). The plots corresponds: (left to right) to the DEWP, PRESS, and TEMP properties in the dataset Chen (2019), and (top to bottom): to the years 2013, 2014, 2015, and 2016. Methods RandomSample and NearConvexCoreset produced very large errors and are clipped in some cases.

Figure 11: Evaluation of Method (ii), similar to Method(i) in Fig. 10.

Figure 12: Visual illustration of the n-signal P corresponding to Temperature and Absolute Humidity properties of the Dataset Vito (2016), along with two rational function fitting methods: (i) using the approximate rational function computed in our algorithm FRFF-coreset, (ii) the output of a call to Scipy.optimize.minimize that aims to fit a rational function of degree 2 to the input signal, and (iii) Polynomial of degree 3, computed using the numpy.polyfit function. For fair comparison, all three methods use 4 free parameters.

Figure 13: Results for an experiment similar to the the experiment from Section 3.1, but with dataset Vito (2016). The x-axis shows the compression ration, with respect to percents of the original data. The y-axis shows the approximation error of each compression scheme, for the properties Temperature (left column) and Absolute Humidity (right column). The upper and lower rows present Evaluations (i) and (ii) respectively. The error bars present the 25% and the 75% percentiles. Methods RandomSample and NearConvexCoreset produced very large errors and are thus in some cases were clipped.

O(1) time. BySturm's-theorem Thomas (1941), we can validate each interval candidate for a root in (k + 1) O(1) time. Since there are |X| candidates for roots (intervals and integer), all of the roots can be computed for the sufficient precision in (k + 1)

Proof. Let c * ∈ arg min c ′ ∈R (x,y)∈P |x • c ′ -y|. By Markov's inequality, with probability at least 1/2

table.

table, which contains the times only for the Absolute Humidity feature in the dataset.

Appendix

In the following lemma we will prove the existence of b as defined in the previous lemma. (ii) For every (x, y) ∈ S we haveProof. Proving that A is invertible, i.e., proving property (i):Recall the definition of A 1 , A 2 , and A from Definition 12. LetSince B 1 and B 2 are Vandermonde matrix for different x-value det(B 1 ), det(B 2 ) ̸ = 0; the xvalues are different by the definition of S from Definition 12. Hence, by block matrix properties for A to be invertible it suffices that det B 4 -), and X 2 = diag(x 2 ) be k × k matrices. By the construction of A and the previous definitions we haveBy the definition of S in Definition 12 we have detIt can be seen that even if there is (x, y) ∈ S such that 1 + x • poly(c ′ , x) = 0, we can add arbitrarily small noise to the y values (without destroying the condition from the previous part), which would tweak c ′ such that for every (x, y) ∈ S we would have 1 + x • poly(c ′ , x) ̸ = 0.In the following lemma we combine the previous lemmas to obtain the previously mentioned desired properties of SOLVER. Lemma 20. We have that SOLVER(S) is never empty and can be computed in O k 3 time.Proof. Let A and ỹ as defined in Definition 12. By the proof of Lemma 19 we have deg(A)By the of Lemma 19 for every (x, y) ∈ S we have 1+x•poly(c ′ , x) ̸ = 0. Therefore, since b satisfies all the conditions in Lemma 18, (c, c ′ ) is a candidate for an output for SOLVER(S). In the following lemma we combine the previous result with some assumptions to obtain the desired guarantees for Algorithm 6. Lemma 21. Let P = {(x 1 , y 1 ), • • • , (x n , y n )} ⊂ R 2 , where n ≥ 2k, be a set of points with unique first coordinates with non equal zero, and where for every S ⊂ P of size 2k we have {x • y | (x, y) ∈ S} = 2k. For everyLet δ ∈ (0, 1), ϵ = 4 -k , and let λ := max log 1-ϵ (δ) , 1 . Let G be the output of a call to FAST-CENTROID-SET(P, λ); see Algorithm 6. Let (c 1 , c ′ 1 ) ∈ arg minand suppose that there iswhich, by the construction of FAST-CENTROID-SET happens with probability at least 1 -δ; follows from assigning P := {(x 1 , y 1 ), • • • , (x n , y n )} in Lemma 17. Suppose that there is ρ ≥ 1 such that for every (x, y) ∈ P we havethen, we have that ℓ P, (c, c ′ ) ≤ αρ • ℓ P, (c 1 , c ′ 1 ) .Proof. We have thatquery considered. Hence, this difference effected the difference between the results for Evaluation (i) and (ii), where for the later RFF-coreset achieved almost consistently better results while in the later lose in many cases. Observe especially the PRES property in Section 3.2, where in Evaluation (i) we had RFF-coreset only worse of equivalent results than FRFF-coreset, but for Evaluation (ii) we had RFF-coreset that gave significantly better results than FRFF-coreset.We note that for TEMP for the year 2013 in Fig. 11 , where we used Evaluation (ii), and Fig. 3 .1 for the Evaluation (ii) of the Temperature property we had equivalent quality between RFF-coreset and FRFF-coreset, where the later had better results in some instances.J.3.1 FIGURES 9 AND 13 :In contrast to the example in Fig. 4 in those examples the polynomial fitting with the same number of parameters yielded slightly better results. This is a non surprising result that follows intuitively from considering that as rational functions might yield better fitting for some datasets (especially "non-smooth" Peiris et al. ( 2021)) for some datasets it is possible to have an opposite effect where polynomial fitting outperforms the rational functions fitting. We believe that this is a specific example of Wolpert (1996) that is related to the well known "No free lunch theorem" Wolpert & Macready (1997) .Nonetheless, even in those examples our approximation outperforms the Scipy's library Virtanen et al. (2020) approximation via Scipy.optimize.minimize by a significant margin and gives only slightly worse results than the Numpy's library Harris et al. (2020) polynomial fitting via numpy.polyfit.Observe that this does not invalidate the real-world data experiments in Section 3, since while polynomial fitting yields lower loss we focused on the task of fitting rational functions that is a valid optimization problem on its own. Also we note that our rational function was obtained from FRFF-coreset and it is plausible that the optimal rational function (with non constant denominator) would outperform the optimal fitting polynomial function with equal number of free parameters.

J.4 DISCUSSION ON THE THEORETICAL RESULT

Another contribution of this work is the theoretical results. Our main result which is the basis of our tested methods is Theorem 5, which informally can be summarised as follows.Given an n-signal P we can compress it to a sub-linear size in a quasi-linear time, and the compression with defined probability allows use to compute ℓ(P, q), for every query q ∈ R k 2 , up to a multiplicative factor of (1 ± ϵ).In our eyes another main contribution of this work is the very uncommon framework used in this work, whose overview is in Section 2.2. In particular we used the Merge-Reduce scheme presented in Braverman et al. (2020) to maintain a BI-CRITERIA tree, where it is usually presented to maintain on-line coresets. We also wish to mention the combination of the BI-CRITERIA approximations, where we essentially "trimmed" the approximation to chunks where the data is "well-behaved" and this allowed use to compute a "leaky-coreset" that fails for a bounded number of BI-CRITERIA for each query. this "leaky-coreset" is used in union with an exhaustive search over the "leaks" that are the BI-CRITERIA in the input where the coreset fails.While those methods might be useful only the researchers in the coreset community, we hope that those novel methods might help build coresets for problems where there was a coreset only for a very specific case, which from a bottom up look can be seen as the focus of this work, i.e., we start with a coreset only for a very restricted case (both in query and form) and we build upon this to obtain an approximation and consecutively an ϵ-coreset without any such limitations.

