EFFICIENT COVARIANCE ESTIMATION FOR SPARSIFIED FUNCTIONAL DATA

Abstract

To avoid prohibitive computation cost of sending entire data, we propose four sparsification schemes RANDOM-KNOTS, RANDOM-KNOTS-SPATIAL, B-SPLINE, BSPLINE-SPATIAL, and present corresponding nonparametric estimation of the covariance function. The covariance estimators are asymptotically equivalent to the sample covariance computed directly from the original data. And the estimated functional principal components effectively approximate the infeasible principal components under regularity conditions. The convergence rate reflects that leveraging spatial correlation and B-spline interpolation helps to reduce information loss. Data-driven selection method is further applied to determine the number of eigenfunctions in the model. Extensive numerical experiments are conducted to illustrate the theoretical results. 1

1. INTRODUCTION

Dimension reduction has received increasing attention to avoid expensive and slow computation. Stich et al. (2018) investigated the convergence rate of Stochastic Gradient Descent after sparsification. Jhunjhunwala et al. (2021) focused on the mean function of a vector containing only a subset of the original vector. The goal of this paper is to estimate the covariance function of sparsified functional data, which is a set of sparsified vectors collected from a distributed system of nodes. Functional data analysis (FDA) has become an important research area due to its wide applications. Classical FDA requires a large number of regularly spaced measurements per subject. The data takes the form {(x ij , j/d) , 1 ≤ i ≤ n, 1 ≤ j ≤ d} in which x i (•) is a latent smooth trajectory, x i (•) = m(•) + Z i (•). (1) The deterministic function m(•) denotes the common population mean, the random Z i (•) are subjectspecific small variation with EZ i (•) = 0. Both m (•) and Z i (•) are smooth functions of time t = j/d which is rescaled to domain D = [0, 1]. Trajectories x i (•) are identically distributed realizations of the continuous stochastic process {x(t), t ∈ D}, E sup t∈D |x(t)| 2 < +∞ which can be decomposed as x(•) = m(•) + Z(•), EZ(t) = 0. The true covariance function is G (t, t ′ ) = Cov {Z(t), Z (t ′ )}. Let sequences {λ k } ∞ k=1 and {ψ k } ∞ k=1 be the eigenvalues and eigenfunctions of G (t, t ′ ), respectively, in which Hsing & Eubank (2015) . Mercer Lemma entails that the ψ k 's are continuous and  λ 1 ≥ λ 2 ≥ • • • ≥ 0, ∞ k=1 λ k < ∞, {ψ k } ∞ k=1 form an orthonormal basis of L 2 [0, 1], see G (t, t ′ ) = ∞ k=1 λ k ψ k (t)ψ k (t ′ ), G (t, t ′ ) ψ k (t ′ ) dt ′ = λ k ψ k (t).

1.1. MAIN CONTRIBUTION

In FDA, covariance estimation plays a critical role in FPC analysis (Ramsay & Silverman (2005) , Li & Hsing (2010) ), functional generalized linear models and other nonlinear models (Yao et al. (2005b) ). We propose four sparsification schemes, RANDOM-KNOTS & RANDOM-KNOTS-SPATIAL can be classified as RANDOM-SPARSIFICATION where knots are selected uniformly from the entire points at random. B-SPLINE & BSPLINE-SPATIAL are called FIXED-SPARSIFICATION, intercepting knots at fixed positions in each dimension of the vector. For all sparsification modes, we construct the two-step covariance estimator Ĝ (•, •), where the first step involving sparsified trajectories and the second step plug-in covariance estimator by using the estimated trajectories in place of the latent true trajectories. The statistic is further multiplied by an appropriate constant to ensure unbiasedness. The covariance estimator Ĝ (•, •) enjoys good properties and can effectively approximates the sample covariance function Ḡ (•, •) computed directly from the original data. This paper improves the performance of statistics from the following two aspects, requiring little or no side information and additional local computation. • SPATIAL CORRELATION: We adjust the fixed weights assigned to different vectors to datadriven parameters that represent the amount of spatial correlation. This family of statistics naturally takes the influence of correlation among nodes into account, which can be viewed as spatial factorfoot_1 . Theoretical derivation reveals that the estimation error can be drastically reduced when spatial factors among subjects are considered. • B-SPLINE INTERPOLATION: To fill in the gap between equispaced knots, we introduce spline interpolation method to characterize the temporally ordered trajectories of the functional data. The estimated trajectory obtained by B-spline smoothing method is as efficient as the original trajectory. The proposed covariance estimator has globally consistent convergence rate, enjoying superior theoretical properties than that without interpolation. Superior to the covariance estimation leveraging tensor product B-splines, our two-step estimators are guaranteed to be the positive semi-definite. In sum, the main advantage of our methods is the computational efficiency and feasibility for largescale dense functional data. It is practically relevant since curves or images measured using new technology are usually of much higher resolution than the previous generation. This directly leads to the doubling of the amount of data recorded in each node, which is also the motivation of this paper to propose sparsification before feature extraction, modeling, or other downstream steps. The paper is organized as follows. Section 2 introduces four sparsification schemes and the corresponding unbiased covariance estimators. We also deduce the convergence rate of the covariance estimators and the related FPCs. Simulation studies are presented in Section 3 and application in domain clustering is in Section 4. All technical proofs are involved in the Appendix. and Sahu et al. (2021) . Sparsification methods mainly focus on sending only a subset of elements of the vectors, yet no existing method combine sparsity method with B-spline fitting. Moreover, there has been striking improvement over sparse PCA. Berthet & Rigollet (2013b) and Choo & d'Orsi (2021) have analyzed the complexity of sparse PCA; Berthet & Rigollet (2013a) and Deshpande & Montanari (2014) have obtained sparse principle components for particular data models. Since our estimation methods are innovative, the related study of PCA is proposed for the first time.

2. MAIN RESULTS

We consider n geographically distributed nodes, each node generates a d-dimensional vector x i = (x i1 , . . . , x id ) ⊤ for i ∈ {1, 2, . . . , n}. The mean and covariance functions could be estimated by m & Davis (2009) . For q ∈ N, µ ∈ (0, 1], write H (q,µ) [0, 1] as the space of µ-Hölder continuous functions, i.e., H (q,µ) For any L 2 integrable functions ϕ(x) and φ(x), x ∈ D, take ⟨ϕ, φ⟩ = D ϕ(x)φ(x)dx, with ∥ϕ∥ 2 2 = ⟨ϕ, ϕ⟩. For simplicity, ∥ϕ∥ = ∥ϕ∥ 2 . We next introduce some technical assumptions. (t) = 1 n n i=1 x i (t) and Ḡ (t, t ′ ) = 1 n n i=1 (x i (t) -m (t)) (x i (t ′ ) -m (t ′ )), see Brockwell [0, 1] = φ : [0, 1] → R | ∥φ∥ q,µ = sup x,y∈[0,1],x̸ =y |φ (q) (x)-φ (q) (y Assumption 1: There exists an integer q > 0 and a constant µ ∈ (0, 1], such that the regression function m(•) ∈ H (q,µ) [0, 1]. In the following, we denote p * = q + µ for simplicity. Assumption 2: The covariance function satisfies sup (t,t ′ )∈[0,1] 2 G (t, t ′ ) < C, for some positive constant C and min t∈[0,1] G (t, t ′ ) > 0. Assumption 3: There exists a constant θ > 0, such that as d → ∞, n = n(d) → ∞, n = O d θ . Assumption 4: The rescaled FPCs ϕ k (•) ∈ H (q,µ) [0, 1] with ∞ k=1 ∥ϕ k ∥ q,µ + ∥ϕ k ∥ ∞ < +∞; for increasing positive integers {k n } ∞ n=1 , as n → ∞, ∞ kn+1 ∥ϕ k ∥ ∞ = O n -1/2 and k n = O (n ω ) for some ω > 0. Assumption 5: The FPC scores {ξ ik } i≥1,k≥1 are independent over k ≥ 1. The number of distinct distributions for all FPC scores {ξ ik } i≥1,k≥1 is finite, and max 1≤k<∞ Eξ r0 1k < ∞ for r 0 > 4. Assumption 6: The number of knots J s ≍ d γ C d for some τ > 0 with C d + C -1 d = O (log τ d) as d → ∞, γ ≥ 1 -θ 2 for RANDOM-SPARSIFICATION, γ > θ 2p * + 2θ r0p * for FIXED-SPARSIFICATION. Assumptions 1-5 are standard requirements for obtaining mean and covariance estimators in literature. Assumption 1 guarantees the orders of the bias terms of the spline smoothers for m (•). Assumption 2 ensures the covariance G (•, •) is uniformly bounded. Assumption 3 implies the dimension of vectors d diverges to infinity as n → ∞, which is a well-developed asymptotic scenario for dense functional data and all the following asymptotics are developed as both the number of nodes n and dimensionality d tend to infinity. Assumption 4 concerns the bounded smoothness of FPC and Assumption 5 ensures bounded FPC scores, for bounding the bias terms in the spline covariance estimator. The smoothness of our estimator is controlled by the number of knots, which is specified in Assumption 6. Remark 1. These assumptions are mild conditions that can be satisfied in many practical situations. One simple and reasonable setup is: q + µ = p * = 4, θ = 1, γ = 5/8, C d ≍ log log d. We set p = 4 for RANDOM-SPARSIFICATION to obtain cubic spline estimation and p = 0 for FIXED-SPARSIFICATION to get estimates without interpolation. These constants are used as defaults in numerical studies.

2.1. RANDOM-SPARSIFICATION

Elements in each vector are randomly setting to zero with probability 1 -Js d . Under this sparsification scheme, RANDOM-KNOTS and RANDOM-KNOTS-SPATIAL covariance estimators are proposed, where temporal dependence exhibited in functional samples is ignored. Proportion Js d depicts the difference of data volume before and after sparsification, reflecting the degree of sparsification. ⊤ with sparsified

RANDOM-KNOTS (RK) As shown in

h i = (h i1 , . . . , h id ) ⊤ , i.e. m = 1 n d Js n i=1 h i and Ĝ = 1 n d Js 2 n i=1 h i -h h i -h ⊤ , where h = 1 n n i=1 h i is a d-dimensional vector. The next theorem states the mean squared error (MSE) of the estimator tends to zero as n → ∞. Theorem 1. (RK Estimation Error) Under Assumptions 1-5, MSE of estimate Ĝ produced by the RK sparsification scheme described above is given by E∥ Ĝ -Ḡ∥ 2 = 1 n 2 d Js 2 -1 R 1 where R 1 = n i=1 ∥x i -m∥ 4 . Assumption 6 further guarantees that Ĝ -Ḡ = O p n -1/2 . RANDOM-KNOTS-SPATIAL (RK-SPAT) Denote by M j the number of nodes that send their j-th coordinate to describe the correlation between nodes. It is obvious that M j is a binomial random variable that takes values in the range {0, 1, . . . , n} with P (M j = m) = n m p m (1 -p) n-m , p = Js d . If M j = 0, none of the nodes have drawn the j-th element, and the information at position j is completely missing. If nodes are highly correlated, the estimator is accurate at position j even if few points at that position are selected. Consider a special case where vectors of all nodes are the same, i.e., x 1 = x 2 = . . . = x n . The j-th coordinate of mean can be exactly estimated as mj = 1 Mj n i=1 h ij whenever M j > 0. And the exact covariance function is Ĝjj ′ = 1 Mj n i=1 h ij -hj h ij ′ -hj ′ , hj is the j-th element of the d-dimensional vector h, j ′ ̸ = j. Simple mathematical derivation implies that hj = mj under this situation. Hence, the fixed scaling parameter Js d is not necessary. If nodes are lowly correlated, small M j may lead to a large MSE. Consider (i) vectors corresponding to n -1 nodes follow sine distribution x ij = sin 2π j d , 1 ≤ i ≤ n -1, 1 ≤ j ≤ d and the outlier node follows cosine distribution x nj = cos 2π j d ; (ii) the outlier vector has a jump at j-th position x nj = sin 2π j d + δ, x nj ′ = sin 2π j ′ d , δ > 0, for j ′ ∈ {1, . . . j -1, j + 1, . . . , d} while other n -1 nodes follow standard sine distribution. In special case that only the outlier vector is selected for position j, the estimation at this position is bound to have large deviation. Therefore, the correlation between nodes is an important indicator to measure the accuracy of estimators. We propose a RK-SPAT estimator wherein the fixed scaling parameter Js d is replaced by a function of M j such that the spatial correlation between nodes is taken into account. Specifically, the mean estimator for j-th element is mSPAT j = 1 n β T (Mj ) n i=1 h ij . Covariance function at position (j, j ′ ) is ĜSPAT jj ′ = 1 n β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ (2) where the introduced function T (M j ) changes the scaling parameter and β is β-1 = J s d E Mj |Mj ≥1 1 T (M j ) = n r=1 J s dT (r) n -1 r -1 J s d r-1 1 - J s d n-r . We prove that the RK-SPAT covariance estimator is unbiased. Proposition 1. (RK-SPAT estimator Unbiasedness) E ĜSPAT = Ḡ. The following theorem measures the approximate quality of RK-SPAT estimator, Theorem 2. (RK-SPAT Estimation Error) Under Assumptions 1-5, MSE of estimate produced by the RK-SPAT family is E∥ ĜSPAT -Ḡ∥ 2 = 1 n 2 d Js + c 1 2 -1 R 1 + 1 n 2 (1 -c 2 ) 2 -1 R 2 , where R 1 = n i=1 ∥x i -m∥ 4 , R 2 = 2 n i=1 n k=i+1 (x i -m) 2 , (x k -m) 2 and β is defined in (3). The parameters c 1 , c 2 depend on the choice of T (•) as c 1 = β2 n r=1 J s dT (r) 2 n -1 r -1 J s d r-1 1 - J s d n-r - d J s c 2 = 1 -β2 n r=2 J 2 s d 2 T (r) 2 n -2 r -2 J s d r-2 1 - J s d n-r . Assumption 6 further guarantees that ĜSPAT -Ḡ = O p n -1/2 . Theorem 2 can be further simplified as E∥ ĜSPAT -Ḡ∥ 2 = 1 n 2 d Js 2 + c 2 1 + 2c 1 d Js -1 R 1 + 1 n 2 c 2 2 -2c 2 R 2 . The MSE of RK-SPAT covariance estimator ensures to be lower than that of RK estimator whenever c 2 1 + 2c 1 d Js R 1 < 2c 2 -c 2 2 R 2 , i.e. R 2 /R 1 > c 2 1 + 2c 1 d Js / 2c 2 -c 2 2 . In general, since the MSE depends on the function T (•) through c 1 and c 2 , we can define T (•) to ensure that RK-SPAT estimate is more accurate than RK estimate. Theorem 3. (RK-SPAT minimum MSE) The optimal RK-SPAT estimator that minimizes the MSE in Theorem 2, can be obtained by setting T * (r) = 1 + R2 R1 r-1 n-1 2 1/2 , for r ∈ {1, 2, . . . , n}. 2021) claimed that optimal RK-SPAT mean estimator is obtained when

Jhunjhunwala et al. (

T * (r) = 1 + R2 R1 r-1 n-1 , which is perfect for tasks involving only mean function, such as K-means but cannot guarantee the optimal estimation of covariance function. It is significant to get accurate covariance estimate by setting T * (•) as Theorem 3. In this way, the extracted features computed from eigenequation would improve the efficiency of downstream tasks, such as PCA. Meanwhile, the nodes number n and dimension d yield the amount of computation for R 1 and R 2 . We propose the RK-SPAT (AVG) T (r) = 1 + n 2 r-1 n-1 2 1/2 as a default setting to avoid complicated computation about R 1 and R 2 .

2.2. FIXED-SPARSIFICATION

We retain elements at J s fixed positions and set the rest to zero. This dimensionality reduction method only utilizes values at fixed positions in the vector and has several disadvantages. (i) Each step only leverages the subset of data with size n × J s , while the size of the origin data set is n × d. The fact that J s ≪ d resulting in a serious loss of information. (ii) Approximate quality of the estimator depends on the selected points, and is difficult to control if the selected knots deviate from the overall distribution. (iii) Even if suitable knots are determined by adding penalty terms or selecting artificially, these knots may not be suitable for another node. B-SPLINE (BS) As shown in Figure 2 .2 (b), B-spline interpolation reduces the loss of information by fitting the values between fixed knots. It is worth noticing that the choice of basis functions and other smoothing methods (polynomial, kernel and wavelet smoothing) do not affect the largesample theories. We choose standard B-spline bases because they are more computationally efficient and numerically stable in finite samples compared with other basis functions such as the truncated power series and trigonometric series. The B-spline estimation is suitable for analyzing large data sets without uniform distribution, see Schumaker (2007) .

Denote by {t

ℓ } Js ℓ=1 a sequence of equally-spaced points, t ℓ = ℓ/ (J s + 1), 0 < t 1 < • • • < t Js < 1, called interior knots, which divide the interval [0, 1] into (J s + 1) equal subintervals I 0 = [0, t 1 ) , I ℓ = [t ℓ , t ℓ+1 ), ℓ ∈ {1, . . . , J s -1}, I Js = [t Js , 1]. Let t 1-p = • • • = t 0 = 0, 1 = t Js+1 = • • • = t Js+p be auxiliary knots, and S (p-2) = S (p-2) [0, 1] be the polynomial spline space of order p on I ℓ , ℓ ∈ {0, . . . , J s }, which consists of all (p -2) times continuously differentiable functions on [0, 1] that are polynomials of degree (p -1) on subintervals I ℓ . Denote by {B ℓ,p (t), 1 ≤ ℓ ≤ J s + p} the p-th order B-spline basis functions of S (p-2) , hence  S (p-2) = Js+p ℓ=1 λ ℓ,p B ℓ,p (t) | λ ℓ,p ∈ R, t ∈ [0, 1] . The unknown trajectory x i (•) is estimated by h i (•) = argmin g(•)∈S (p-2) d j=1 {x ij -g (x j )} 2 = Js+p ℓ=1 λℓ,p,i B ℓ,p (i) ∥ m -m∥ ∞ = O a.s. n -1/2 , ∥ Ĝ -Ḡ∥ ∞ = O p n -1/2 . (ii) ∥ mSPAT -m∥ ∞ = O p n -1/2 , ∥ ĜSPAT -Ḡ∥ ∞ = O p n -1/2 . Remark 2. The convergence rates in Theorems 1, 2 and 4 reflect that BS (BS-SPAT) covariance estimator converges faster to Ḡ than RK (RK-SPAT) estimator and the result uniformly holds on the interval D. It is obvious that the estimation performance of the proposed estimator is significantly improved by applying B-spline interpolation. This conclusion is also confirmed in the simulation.

2.3. CONVERGENCE OF PRINCIPAL COMPONENT

The estimates of eigenfunctions and eigenvalues are obtained by solving the eigenequations Ĝ (x, x ′ ) ψk (x ′ ) dx ′ = λk ψk (x), the consistency of which is then obtained. Theorem 5. As n → ∞, for k ∈ N, we have (i) (Convergence rate of eigenfunctions) ψk -ψ k = O p n -1/2 ; (ii) (Convergence rate of eigenvalues) λk -λ k = O p n -1/2 ; (iii) (Convergence rate of FPC scores) max 1≤i≤n ξik -ξ ik = O p n -1/2 . It is worth noticing that the orthonormal basis of the eigenmanifold corresponding to {λ k } κ k=1 may be obtained by rotation, seeDauxois et al. (1982) . Therefore, the unique form of the eigenfunction should be determined by minimizing the estimation error through the loss function L( φk , ϕ k ) = 1 2 min s∈{+1,-1} ∥ φk -sϕ k ∥ 2 = 1 -|⟨ φk , ϕ k ⟩| for φk , ϕ k ∈ {v ∈ R κ : ∥v∥ = 1} . This is required because the estimated principal components φk κ k=1 are only identifiable up to a sign. Analogous results can obtained for alternate loss functions such as the projection distance: L p ( φk , ϕ k ) = 1 √ 2 φk φ⊤ k -ϕ k ϕ ⊤ k 2 = 1 -⟨ φk , ϕ k ⟩ 2 .

3. SIMULATION

We conduct simulation studies to illustrate the finite-sample performance of the proposed methods.

3.1. KNOTS SELECTION

The number of knots is treated as an unknown tuning parameters, and the fitting results can be sensitive to it. Since the in-sample fitting errors cannot gauge the prediction accuracy of the fitted function, we select a criterion function that attempts to measure the out-of-sample performance of the fitted model. .

3.2. ACCURACY OF COVARIANCE ESTIMATOR

Data is generated from model: x ij = m (j/d) + ∞ k=1 ξ ik ϕ k (j/d) , 1 ≤ j ≤ d, 1 ≤ i ≤ n, k ≥ 1, where Yao et al. (2005a) . d (or n) is set to vary equally between 50 and 400 with fixed n = 200 (or d = 200). Each simulation is repeated 1000 times. m(t) = sin{2π(t -1/2)}, ϕ k (t) = √ λ k ψ k (t), λ k = (1/4) [k/2] , ψ 2k-1 (t) = √ 2 cos(2kπt), ψ 2k (t) = √ 2 sin(2kπt). {ξ ik } follow standardized normal distribution. The infinite series G (t, t ′ ) = ∞ k=1 ϕ k (t)ϕ k (t ′ ) is well approximated by finite sum G (t, t ′ ) = 1000 k=1 ϕ k (t)ϕ k (t ′ ), according to the fraction of variance explained (FVE) criteria, FVE = 1000 k=1 λ k / ∞ k=1 λ k > 1 -10 -10 , see Denote by Ĝs j,j ′ ( Ḡs j,j ′ ) the s-th replication of covariance Ĝ ( Ḡ) at position (j, j ′ ) and G the true covariance function. The average mean squared error (AMSE) is computed to assess the performance of the covariance estimators Ĝ(•, •) and Ḡ(•, •), which is defined as AMSE( Ĝ) = 1 1000d 2 1000 s=1 d j,j ′ =1 Ĝs j,j ′ -Ḡs j,j ′ 2 , AMSE Ḡ = 1 1000d 2 1000 s=1 d j,j ′ =1 Ḡs j,j ′ -G j,j ′ 2 . Figure 1 foot_2 shows that AMSE( Ĝ) decreases as n increases, consistent with Theorems 1 and 2. AMSE( Ĝ) reveals a slow downward trend with the increase of d, mainly because the number of knots changes with d, affecting the performance of the covariance estimator. By setting T (•) as Theorem 3, AMSE of estimator that takes spatial factor into account is generally lower than estimator that does not. AMSE of BS (BS-SPAT) covariance estimator confirms Theorem 4 and estimation accuracy of the estimator is significantly improved by spline interpolation. Results on AMSE Ḡ are consistent with the fact that Ḡ converges to G at the rate of O p (n -1/2 ). Visualization of covariance functions is in Figure H in Appendix.

3.3. ACCURACY OF PRINCIPLE COMPONENTS

Spectral decomposition is truncated at κ = 5 according to the standard criteria called "pseudo-AIC", see Mu et al. (2008) . The selected eigenvalues can explain over 95% of the total variation. That is, (2019)). We cluster these vector representations through a Gaussian Mixture Model (GMM-k) where k is the number of predetermined clusters. The estimated FPC scores are naturally independent random variables and converge to the infeasible true FPC scores at the rate of n -1/2 in probability, see Theorem 5 (iii). It is reasonable to set the estimated FPC scores as explanatory variables in GMM-k step, which are computed from ξik = λ-1/2 k {h i (t) -m (t)} ψk (t) dt with λk and ψk the estimates of eigenvalues and eigenfunctions of covariance estimator. We also set original vectors and FPC scores computed from standard PCA method as explanatory variables for comparison. To evaluate whether the resulting clusters indeed capture the domains, we measure the Clustering Purity, which is a well-known metric for evaluating clustering, see Schütze et al. (2008) . Table 1 shows MLM-based models dominate over word2vec and auto-regressive models. Mainly because MLM-based models use the entire sentence context when generating the representations, while auto-regressive models only use the past context, and word2vec uses a limited window context. Direct modeling with data without dimensionality reduction has the worst results, and PCA significantly improves the performance in all cases. The last four columns of Table 1 reflect that Purity of the domain-clustering task is not sacrificed and even slightly improves in some cases. Hence, modeling on FPC scores computed from sparsified covariance estimators are shown to be efficient and effective. In sum, by applying our sparsification methods and related covariance estimation, we achieve the same performance as standard PCA on the basis of accelerating computation speed.

5. CONCLUSIONS AND LIMITATION

In this paper, RK (RK-SPAT) estimator converges to an averaged sample estimator without sparsification at the rate of O p (n -1/2 ), and BS (BS-SPAT) estimator converges at the rate of O p (n -1/2 ). We further characterize the uniform weak convergence of the corresponding estimation of eigenvalues and eigenvectors. It is necessary to take spatial factor into account when correlation across nodes is non-negligible, thus standard approach to averaging sample vectors can lead to high estimation error. And spline interpolation is carried out to avoid the loss of overall data information. Theoretical results are backed by simulation and application. A few more issues still merit further research. The AIC selection method works well in practice, but a stronger theoretical justification for its use is still needed. Our work focuses on the approximation and estimation, while in recent years, there has been a great deal of work on deriving approximate distribution, which is crucial for making global inference. It is also worth exploring to extend our novel sparsification methodology to functional regression model and large-scale longitudinal model, which is expected to find more applications in various scientific fields. Covariance estimation in such models is a significant challenge and requires more in-depth investigation. Denote by V n,p the empirical inner product matrix of B-spline basis {B ℓ,p (t)} Js+p ℓ=1 , i.e. V n,p = ⟨B ℓ,p , B ℓ ′ ,p ⟩ d Js+p ℓ,ℓ ′ =1 = d -1 B ⊤ B. Denote x i = (x i (1/d), . . . , x i (d/d)) ⊤ , m = (m(1/d), . . . , m(d/d)) ⊤ and Z i = (Z i (1/d), . . . , Z i (d/d)) ⊤ . The form of model ( 1) ensures that spline estimator h i (•) allows rep- resentation h i (•) = d -1 B(•) ⊤ V -1 n,p B ⊤ x i = m(•) + Ẑi (•), where m(•) = d -1 B(•) ⊤ V -1 n,p B ⊤ m, Ẑi (•) = d -1 B(•) ⊤ V -1 n,p B ⊤ Z i .

B PROOF OF THEOREM 1

According to the definition of Ĝ and Ḡ, the MSE can be computed as E∥ Ĝ -Ḡ∥ 2 = d j,j ′ =1 E ∥ 1 n d J s 2 n i=1 h ij -hj h ij ′ -hj ′ - 1 n n i=1 (x ij -mj ) (x ij ′ -mj ′ ) ∥ 2 Now, as E d Js h ij = x ij and E d Js h ij ′ = x ij ′ , it holds that 1 n 2 n i=1 E d J s 2 h ij -hj h ij ′ -hj ′ -(x ij -mj ) (x ij ′ -mj ′ ) 2 = 1 n 2 n i=1   E d J s 2 h ij -hj h ij ′ -hj ′ 2 -E ((x ij -mj ) (x ij ′ -mj ′ )) 2   where hj stands for the j-th element of the d-dimensional vector h. Since h ij = x ij with probability J s /d and h ij = 0 with probability 1 -J s /d, therefore E d J s 2 h ij -hj h ij ′ -hj ′ 2 = d J s 4 E h ij -hj h ij ′ -hj ′ 2 = d J s 2 E ((x ij -mj ) (x ij ′ -mj ′ )) 2 Since elements in each vector are assumed to be generated independently for RANDOM-SPARSIFICATION, that is x ij is independence of x ij ′ , 1 ≤ i ≤ n, j ̸ = j ′ . Hence, E∥ Ĝ -Ḡ∥ 2 = 1 n 2 d J s 2 -1 n i=1 d j,j ′ =1 E ((x ij -mj ) (x ij ′ -mj ′ )) 2 = 1 n 2 d J s 2 -1 n i=1 E   d j=1 (x ij -mj ) 2 d j ′ =1 (x ij ′ -mj ′ ) 2   = 1 n 2 d J s 2 -1 R 1 where R 1 = n i=1 ∥x i -m∥ 4 . Moreover, according to Assumption 6, n -1 d J s 2 -1 ≍ d -θ d d γ C d 2 → 0. The bounded R 1 further tells that E∥ Ĝ-Ḡ∥ 2 = O(n -1 ), and consequently ∥ Ĝ-Ḡ∥ = O p (n -1/2 ).

C PROOF OF PROPOSITION 1

The trick is to introduce the random variable ξ ij to aid in the computation of conditional expectations E Mj |Mj ≥1 (•). Let ξ ij be an indicator random variable which takes value in {0, 1}, depending on whether  h ij = x ij or not for 1 ≤ i ≤ n, 1 ≤ j ≤ d, lead- ing to E Mj |Mj ≥1,M j ′ ≥1 (•) = E Mj |ξij =1,ξ ij ′ =1 (•), E Mj |Mj =0,M j ′ ≥1 (•) = E Mj |ξij =0,ξ ij ′ =1 (•), E Mj |Mj =0,M j ′ =0 (•) = E Mj |ξij =0,ξ ij ′ =0 (•). Event {ξ ij = 1, ξ ij ′ = 1} {ξ ij = 0, ξ ij ′ = 0} happens with probability 1 -Js d 2 . Case 1: With probability 1 -Js d 2 , event {ξ ij = 0, ξ ij ′ = 0} holds, which implies h ij = 0 and h ij ′ = 0. Therefore, E Mj ,M j ′ |ξij =0,ξ ij ′ =0 β2 h ij -hj h ij ′ -hj ′ T (M j ) T (M j ′ ) =E Mj |ξij =0 β h ij -hj T (M j ) E M j ′ |ξ ij ′ =0 β h ij ′ -hj ′ T (M j ′ ) = 0. Case 2: With probability Js d 1 -Js d , event {ξ ij = 0, ξ ij ′ = 1} holds, which implies h ij = 0 and h ij ′ = x ij ′ . Still we have E Mj ,M j ′ |ξij =0,ξ ij ′ =1 β2 h ij -hj h ij ′ -hj ′ T (M j ) T (M j ′ ) = 0. Case 3: With probability Js d 2 , event {ξ ij = 1, ξ′ = 1} holds, which implies h ij = x ij and h ij ′ = x ij ′ . Therefore, E Mj ,M j ′ |ξij =1,ξ ij ′ =1 β2 h ij -hj h ij ′ -hj ′ T (M j ) T (M j ′ ) =E Mj ,M j ′ |Mj ≥1,M j ′ ≥1 β2 h ij -hj h ij ′ -hj ′ T (M j ) T (M j ′ ) = β2 (x ij -mj ) (x ij ′ -mj ′ ) E Mj ,M j ′ |Mj ≥1,M j ′ ≥1 1 T (M j ) T (M j ′ ) . A crucial observation is that ξ ij = 1 only implies M j ≥ 1 and does not give any other information about M j . Taking expectation with respect to ξ ij we have, E ξij ,ξ ij ′ E Mj ,M j ′ |ξij ,ξ ij ′ β2 h ij -hj h ij ′ -hj ′ T (M j ) T (M j ′ ) = J s d 2 β2 (x ij -mj ) (x ij ′ -mj ′ ) E Mj ,M j ′ |Mj ≥1,M j ′ ≥1 1 T (M j ) T (M j ′ ) = (x ij -mj ) (x ij ′ -mj ′ ) . This proves Proposition 1.

D PROOF OF THEOREM 2

According to the definition, the (j, j ′ )-th element of ĜSPAT is ĜSPAT jj ′ = 1 n d J s 2 E Mj |Mj ≥1 1 T (Mj ) E M j ′ |M j ′ ≥1 1 T (M j ′ ) -1 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ = 1 n β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ MSE can be computed as E∥ ĜSPAT -Ḡ∥ 2 = d j,j ′ =1 E ĜSPAT jj ′ -Ḡjj ′ 2 = d j,j ′ =1 E 1 n β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ - 1 n n i=1 (x ij -mj ) (x ij ′ -mj ′ ) 2 (5) As the estimator is designed to be unbiased, i.e., E 1 n β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ = 1 n n i=1 (x ij -mj ) (x ij ′ -mj ′ ) , it holds that E 1 n β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ - 1 n n i=1 (x ij -mj ) (x ij ′ -mj ′ ) 2 = 1 n 2 E β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ 2 - 1 n 2 n i=1 (x ij -mj ) (x ij ′ -mj ′ ) 2 We now analyze the first term above. E β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ 2 = n i=1 β4 E h ij -hj 2 h ij ′ -hj ′ 2 T (M j ) 2 T (M j ′ ) 2 + 2 n i=1 n k=i+1 β4 E h ij -hj h kj -hj h ij ′ -hj ′ h kj ′ -hj ′ T (M j ) 2 T (M j ′ ) 2 . Note that the expectation is taken over the randomness in h ij as well as T (M j ). Further, β4 (hij-hj ) 2 (h ij ′ -hj ′ ) 2 T (Mj ) 2 T (M j ′ ) 2 is non-zero only when a node i samples coordinate j and j ′ , i.e., h ij = x ij and h ij ′ = x ij ′ . This implies that M j ≥ 1 and M j ′ ≥ 1. Therefore, by the law of total expectation, we have β4 E h ij -hj 2 h ij ′ -hj ′ 2 T (M j ) 2 T (M j ′ ) 2 = β4 E Mj |Mj ≥1 J s (x ij -mj ) 2 dT (M j ) 2 E M j ′ |M j ′ ≥1 J s (x ij ′ -mj ′ ) 2 dT (M j ′ ) 2 =   β4 n r,r ′ =1 J s dT (r) 2 J s dT (r ′ ) 2 n -1 r -1 n -1 r ′ -1 J s d r+r ′ -2 1 - J s d 2n-r-r ′   × (x ij -mj ) 2 (x ij ′ -mj ′ ) 2 = d J s + c 1 2 (x ij -mj ) 2 (x ij ′ -mj ′ ) 2 (8) where c 1 is defined in Theorem 2. Following a similar argument as above, note that (hij-hj )(hkj-hj )(h ij ′ -hj ′ )(h kj ′ -hj ′ ) T (Mj ) 2 T (M j ′ ) 2 is non-zero only when nodes i and k sample coordinate j and j ′ , i.e., h ij = x ij , h kj = x kj , h ij ′ = x ij ′ , h kj ′ = x kj ′ . This implies that M j ≥ 2 and M j ′ ≥ 2. Therefore, by the law of total expectation, we have β4 E h ij -hj h kj -hj h ij ′ -hj ′ h kj ′ -hj ′ T (M j ) 2 T (M j ′ ) 2 = β4 E Mj |Mj ≥2 J s (x ij -mj ) (x kj -mj ) dT (M j ) 2 E M j ′ |M j ′ ≥2 J s (x ij ′ -mj ′ ) (x kj ′ -mj ′ ) dT (M j ′ ) 2 =   β4 n r,r ′ =2 J s dT (r ′ ) 2 J 2 s d 2 T (r ′ ) 2 n -2 r -2 n -2 r ′ -2 J 2 s d 2 r+r ′ -4 1 - J s d 2n-r-r ′   × (x ij -mj ) (x kj -mj ) (x ij ′ -mj ′ ) (x kj ′ -mj ′ ) = (1 -c 2 ) 2 (x ij -mj ) (x kj -mj ) (x ij ′ -mj ′ ) (x kj ′ -mj ′ ) where c 2 is defined in Theorem 2. Substituting ( 8) and ( 9) into (7), we get E β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ 2 = d J s + c 1 2 n i=1 (x ij -mj ) 2 (x ij ′ -mj ′ ) 2 + (1 -c 2 ) 2 n i=1 n k=i+1 (x ij -mj ) (x kj -mj ) (x ij ′ -mj ′ ) (x kj ′ -mj ′ ) Now, substituting (10) into (6), we get E 1 n β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ - 1 n n i=1 (x ij -mj ) (x ij ′ -mj ′ ) 2 = 1 n 2 d J s + c 1 2 -1 n i=1 (x ij -mj ) 2 (x ij ′ -mj ′ ) 2 + 1 n 2 (1 -c 2 ) 2 -1 n i=1 n k=i+1 (x ij -mj ) (x kj -mj ) (x ij ′ -mj ′ ) (x kj ′ -mj ′ ) Finally substituting (11) into (5) we get, E∥ ĜSPAT -Ḡ∥ 2 = 1 n 2 d J s + c 1 2 -1 R 1 + 1 n 2 (1 -c 2 ) 2 -1 R 2 where R 1 = n i=1 ∥x i -m∥ 4 and R 2 = 2 n i n k=i+1 (x i -m) 2 , (x k -m) 2 . Moreover, according to Assumption 6 and the boundness of c 1 , n -1 d J s + c 1 2 -1 ≍ d -θ d d γ C d 2 + d -θ d d γ C d → 0. The bounded R 1 further tells that n -2 d Js + c 1 2 -1 R 1 = O(n -1 ). The bounded c 2 and R 2 assure that n -2 (1 -c 2 ) 2 -1 R 2 = O(n -2 ). Therefore, E∥ ĜSPAT -Ḡ∥ 2 = O(n -1 ) and thus ∥ ĜSPAT -Ḡ∥ = O p (n -1/2 ).

E PROOF OF THEOREM 3

Note that c 1 and c 2 are completely determined by the original data, while R 1 and R 2 are dependent on function T (•). Observing the result in Theorem 2, the only term that depends on T (•) is d Js + c 1 2 R 1 + (1 -c 2 ) 2 R 2 since these terms contain c 1 and c 2 , which are computed from T (•). Thus to find the function T * (•) that minimizes the MSE, we just need to minimize this term. From the definitions of c 1 and c 2 in Theorem 2, we can obtain the following expression for T * (•) T * (r) = arg min T    β4 n r=1 J s dT (r) 2 n -1 r -1 J s d r-1 1 - J s d n-r 2 + R 2 R 1 β4 n r=2 J 2 s d 2 T (r) 2 n -2 r -2 J s d r-2 1 - J s d n-r 2    . We claim that T * (r ) = 1 + R2 R1 r-1 n-1 2 1/2 is an optimal solution for our objective defined in (12). To see this, consider the following cases, Case 1: p = 0 or p = 1. In this case c 1 and c 2 are independent of T (•) and hence our objective does not depend on the choice of T (•). Case 2: 0 < p < 1, we define w * = arg min w w ⊤ Aw (b ⊤ w) 2 , ( ) where w is a n-dimensional vector whose r-th entry is w r = 1/T (r) 2 , b is a vector whose r-th entry is b r = n -1 r -1 p r-1 (1 -p) n-r 2 where p = Js d , and A is a diagonal matrix whose r-th diagonal entry is A rr = n -1 r -1 p r-1 (1 -p) n-r 2 + R 2 R 1 p n -2 r -2 p r-2 (1 -p) n-r 2 = b r 1 + R 2 R 1 r -1 n -1 2 . Note that A rr > 0 for all r ∈ {1, 2, . . . , n} which implies that w → A 1/2 w is a one-to-one mapping. Therefore setting z = A 1/2 w, the objective in (13) reduces to z * = arg min z ∥z∥ 2 b ⊤ A -1/2 z 2 (14) Observe that the objectives ( 13) and ( 14) are invariant to the scale of T (•), w and z respectively and thus the solutions will be unique up to a scaling factor. Therefore, in the case of ( 14), it is sufficient to solve the reduced objective, z * = arg min z,∥z∥=1 ∥z∥ 2 b ⊤ A -1/2 z 2 = arg min z,∥z∥=1 1 b ⊤ A -1/2 z 2 which is minimized by z * = A -1/2 b ∥A -1/2 b∥ . Therefore, the optimal solution (up to a constant) is w * = A -1/2 A -1/2 b . Correspondingly, we conclude that T * (r) = (w * r ) -1/2 = A rr b r 1/2 = 1 + R 2 R 1 r -1 n -1 2 1/2 . minimizes ( 12), and consequently minimizes the MSE of the RK-SPAT estimator. F PROOF OF THEOREM 4 PROOF OF (I) We first provide several technical lemmas. Lemma 1. Let W i ∼ N 0, σ 2 i , σ i > 0, i ∈ {1, 2, . . . , n}, then for n > 2, a > 2, P max 1≤i≤n |W i /σ i | > a log n < 2 π n 1-a 2 /2 . Hence, (max 1≤i≤n |W i |) / (max 1≤i≤n σ i ) ≤ max 1≤i≤n |W i /σ i | = O a.s. ( √ log n). PROOF. Note that for n > 2, a > 2, P max 1≤i≤n W i σ i > a log n ≤ n i=1 P W i σ i > a log n ≤2n{1 -Φ(a log n)} < 2n ϕ(a √ log n) a √ log n ≤ 2nϕ(a log n) = 2 π n 1-a 2 /2 where ϕ(x) denotes standard probability density function at x and Φ(x) denotes the corresponding cumulative distribution function. Lemma 1 follows by applying Borel-Cantelli lemma. Lemma 2. As n → ∞, we have max 1≤i≤n ∥h i -x i ∥ ∞ = O a.s. J -p * s (n log n) 2/r0 = O a.s. n -1/2 . PROOF. The trajectory x i (t) can be written as x i (t) = m(t) + ∞ k=1 ξ ik ϕ k (t). Denote ϕ k = (ϕ k (1/d), . . . , ϕ k (d/d)) ⊤ , and let φk (t) = d -1 B(t) ⊤ V -1 n,p B ⊤ ϕ k be the B-spline smoothing of ϕ k (t). The linearity of spline smoothing implies that h i (t) -x i (t) = m(t) -m(t) + ∞ k=1 ξ ik φk (t) -ϕ k (t) . Lemma A.4 in Cao et al. (2012) assures there exists a constant C q,µ > 0, such that ∥ m -m∥ ∞ ≤ C q,µ ∥m∥ q,µ J -p * s , ( ) φk -ϕ k ∞ ≤ C q,µ ∥ϕ k ∥ q,µ J -p * s , k ≥ 1 (16) Thus, with norm inequality, we have ∥h i -x i ∥ ∞ ≤ ∥ m -m∥ ∞ + ∞ k=1 |ξ ik | φk -ϕ k ∞ ≤ C q,µ W i J -p * s where W i = ∥m∥ q,µ + ∞ k=1 |ξ ik | ∥ϕ k ∥ q,µ are i.i.d. nonnegative random variables. W r0 i has a finite absolute moment and we have P max 1≤i≤n W i > (n log n) 2/r0 ≤ n EW r0 i (n log n) 2 = EW r0 i (n log n) -2 which implies ∞ n=1 P max 1≤i≤n W i > (n log n) 2/r0 ≤ EW r0 i ∞ n=1 (n log n) -2 < +∞ According to Borel Cantelli lemma, we conclude that max 1≤i≤n W i = O a.s. (n log n) 2/r0 , which together with ( 15) and ( 16), prove the result max 1≤i≤n ∥h i -x i ∥ ∞ = O a.s. J -p * s (n log n) 2/r0 Moreover, Assumption 6 assures that n 1/2 J -p * s (n log n) 2/r0 ≍ d θ/2 (d γ C d ) -p * d θ log d θ 2/r0 = d * +2θ/r0 O (log d) → 0 (17) Therefore, max 1≤i≤n ∥h i -x i ∥ ∞ = O a.s. n -1/2 . According to the definition of m(•) and m(•), m(•) -m(•) can be decomposed as m(•) -m(•) = n -1 n i=1 {h i (•) -x i (•)}. Lemma 2 further tells that sup t∈[0,1] n 1/2 | m(t) -m(t)| ≤ n 1/2 max 1≤i≤n ∥h i -x i ∥ ∞ = o a.s. (1). Hence, ∥ m -m∥ ∞ = O a.s. n -1/2 is proved where m is a BS estimator. Denote by Ẑi (•) = h i (•) -m(•) and Zi (•) = x i (•) -m(•), we further obtain the following lemmas. Lemma 3. As n → ∞ max 1≤i≤n Ẑi -Z i ∞ = O a.s. J -p * s (n log n) 2/r0 . PROOF. Denote φk (x) = d -1 B(x) ⊤ V -1 n,p B ⊤ ϕ k and Ẑi (t) = ∞ k=1 ξ ik φk (t) for k ∈ N + , Ẑi (t) -Z i (t) = ∞ k=1 ξ ik φk (t) -ϕ k (t) . From ( 16) and Assumption 5, Ẑi - Z i ∞ ≤ ∞ k=1 |ξ ik | φk -ϕ k ∞ ≤ CW i J -p * s , where W i = ∞ k=1 |ξ ik | ∥ϕ k ∥ q,µ , are i.i.d nonnegative random variables with finite absolute moment. Then P max 1≤i≤n W i > (n log n) 2/r0 ≤ n EW r0 i (n log n) 2 = EW r0 i n -1 (log n) -2 , thus, ∞ n=1 P max 1≤i≤n W i > (n log n) 2/r0 ≤ EW r0 i ∞ n=1 n -1 (log n) -2 < +∞, so Lemma 1 tells max 1≤i≤n W i = O a.s. (n log n) 2/r0 . Lemma 3 is then obtained. Next we compute the convergence rate of BS covariance estimators. For any t, t ′ ∈ [0, 1], we decompose Ĝ (t, t ′ ) -Ḡ (t, t ′ ) into three parts Ĝ (t, t ′ ) -Ḡ (t, t ′ ) =n -1 n i=1 Ẑi (t) Ẑi (t ′ ) -n -1 n i=1 Zi (t) Zi (t ′ ) =n -1 n i=1 Ẑi (t) -Zi (t) Ẑi (t ′ ) -Zi (t ′ ) + n -1 n i=1 Zi (t ′ ) Ẑi (t) -Zi (t) + n -1 n i=1 Zi (t) Ẑi (t ′ ) -Zi (t ′ ) (18) According to decomposition of {h i } n i=1 and {x i } n i=1 , that is h i (t) = Ẑi (t) + m(t) and x i (t) = Z i (t) + m(t), we get h i (t) -x i (t) = Ẑi (t) -Z i (t) + m(t) -m(t). Noting that m(t) = n -1 n i ′ =1 h i ′ (t) and m(t) = n -1 n i ′ =1 x i ′ (t), then Ẑi (t) -Zi (t) can be represented by Ẑi (t) -Zi (t) = h i (t) -m(t) -(x i (t) -m(t)) -n -1 n i=1 (x i (t) -m(t)) = h i (t) -n -1 n i ′ =1 h i ′ (t) -x i (t) -n -1 n i ′ =1 x i ′ (t) = h i (t) -x i (t) -n -1 n i ′ =1 {h i ′ (t) -x i ′ (t)} . Therefore, we obtain that Ẑi (t) -Zi (t) = Ẑi (t) -Z i (t) -n -1 n i ′ =1 Ẑi ′ (t) -Z i ′ (t) . Hence, n -1 n i=1 Ẑi (t) -Zi (t) Ẑi (t ′ ) -Zi (t ′ ) =n -1 n i=1 Ẑi (t) -Z i (t) Ẑi (t ′ ) -Z i (t ′ ) -n -1 n i=1 Ẑi (t) -Z i (t) n -1 n i=1 Ẑi (t ′ ) -Z i (t ′ ) . According to Lemma 3, it is easy to obtain that n -1 n i=1 Ẑi (t) -Z i (t) ≤ max 1≤i≤n Ẑi -Z i ∞ = O a.s. J -p * s (n log n) 2/r0 = O a.s. n -1/2 where the last equation holds for (17). And n -1 n i=1 Ẑi (t) -Z i (t) Ẑi (t ′ ) -Z i (t ′ ) ≤ max 1≤i≤n Ẑi -Z i ∞ 2 = O a.s. J -2p * s (n log n) 4/r0 = O a.s. n -1/2 . where the last equation follows from Assumption 6, as d → ∞, n 1/2 J -2p * s (n log n) 4/r0 ≍ d θ/2 (d γ C d ) -2p * (d θ log d θ ) 4/r0 = d θ/2-2γp * +4θ/r0 O(log d) → 0 Hence, sup t,t ′ ∈[0,1] n -1 n i=1 Ẑi (t) -Zi (t) Ẑi (t ′ ) -Zi (t ′ ) = O a.s. n -1/2 . (19) Moreover, n -1 n i=1 Zi (t ′ ) Ẑi (t) -Zi (t) =n -1 n i=1 Z i (t ′ } -n -1 n i ′ =1 Z i ′ (t ′ ) Ẑi (t) -Z i (t) -n -1 n i ′ =1 Ẑi ′ (t) -Z i ′ (t) =n -1 n i=1 Z i (t ′ ) Ẑi (t) -Z i (t) -n -2 n i=1 Z i (t ′ ) n i ′ =1 Ẑi ′ (t) -Z i ′ (t) . ( ) Noting that n -1 n i=1 Z i ∞ = O p (1) since E n -1 n i=1 Z i (t) ≤ ∞ k=1 ∥ϕ k ∥ ∞ E ξ•k < ∞ where ξ•k = 1 n n i=1 ξ ik . Then we have n -2 n i=1 Z i (t ′ ) n i ′ =1 Ẑi ′ (t) -Z i ′ (t) ≤ max 1≤i ′ ≤n Ẑi ′ -Z i ′ ∞ n -1 n i=1 Z i ∞ = O p n -1/2 n -1 n i=1 Z i (t ′ ) Ẑi (t) -Z i (t) ≤ max 1≤i≤n Ẑi -Z i ∞ n -1 n i=1 Z i ∞ = O p n -1/2 (21) Substituting ( 21) into (20), we conclude that sup t,t ′ ∈[0,1] n -1 n i=1 Zi (t ′ ) Ẑi (t) -Zi (t) = O p n -1/2 . Similarly, we have sup t,t ′ ∈[0,1] n -1 n i=1 Zi (t) Ẑi (t ′ ) -Zi (t ′ ) = O p n -1/2 . ( ) Substituting ( 19), ( 22) and ( 23) into (18), we have sup t,t ′ ∈[0,1] Ĝ (t, t ′ ) -Ḡ (t, t ′ ) = O p n -1/2 . Then ∥ Ĝ -Ḡ∥ ∞ = O p n -1/2 is proved where Ĝ is BS estimator. PROOF OF (II) Next we prove the conclusion of BS-SPATIAL mean and covariance estimators. The estimation error of spatial mean can be computed as ∥ mSPAT -m∥ ∞ = max 1≤j≤d 1 n β T (M j ) n i=1 h ij - 1 n n i=1 x ij = 1 n max 1≤j≤d β T (M j ) n i=1 h ij - β T (M j ) n i=1 x ij + β T (M j ) n i=1 x ij - n i=1 x ij ≤ β T (M j ) ∥ m -m∥ ∞ + β T (M j ) -1 ∥ m∥ ∞ =O p n -1/2 where the last equality holds by noticing that β T (Mj ) → p 1 from the law of large numbers and from Theorem 4 (i) ∥ m -m∥ ∞ = O a.s. n -1/2 , m is BS mean estimator. The estimation error of spatial covariance can be computed as ∥ ĜSPAT -Ḡ∥ ∞ = max 1≤j,j ′ ≤d ĜSPAT jj ′ -Ḡjj ′ = 1 n max 1≤j,j ′ ≤d β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ - n i=1 (x ij -mj ) (x ij ′ -mj ′ ) ≤ 1 n max 1≤j,j ′ ≤d β2 T (M j ) T (M j ′ ) n i=1 h ij -hj h ij ′ -hj ′ - βj βj ′ T (M j ) T (M j ′ ) n i=1 (x ij -mj ) (x ij ′ -mj ′ ) + 1 n max 1≤j,j ′ ≤d β2 T (M j ) T (M j ′ ) n i=1 (x ij -mj ) (x ij ′ -mj ′ ) - n i=1 (x ij -mj ) (x ij ′ -mj ′ ) ≤ max 1≤j,j ′ ≤d β2 T (M j ) T (M j ′ ) Ĝjj ′ -Ḡjj ′ + max 1≤j,j ′ ≤d β2 T (M j ) T (M j ′ ) -1 Ḡjj ′ ≤O p (1) Ĝ -Ḡ| ∞ + O p n -1/2 Ḡ ∞ =O p n -1/2 where the last equality holds by noticing that O p (n -1/foot_3 ). Consequently, the result Ĝ -Ḡ = O p (n -1/2 ) holds for all four estimators. Moreover, it is easy to obtain that Ḡ -G = O p (n -1/2 ) where G is the true covariance estimator which is usually unknown in application. In sum, Ĝ -G ≤ Ĝ -Ḡ + Ḡ -G = O p (n -1/2 ). It is worth noticing that although although the BS and BS-SPAT estimators enjoys smaller convergence rate, Ĝ -G still converge at the rate of O p (n -1/2 ) since Ḡ -G is the dominant term. Denote ∆ψ k (t) = ( Ĝ -G) (t, t ′ ) ψ k (t ′ ) dt ′ . Based on the result that ∥ Ĝ -G∥ = O p n -1/2 , we have ∥∆ψ k ∥ = O p n -1/2 for any k ≥ 1. Let ∥∆∥ = Ĝ (t, t ′ ) -G (t, t ′ ) 2 dtdt ′ 1/2 = O p n -1/2 , then according to Hall et al. (2006) , we have ψk -ψ k = j:j̸ =k (λ k -λ j ) -1 ⟨∆ψ k , ψ j ⟩ ψ j + O ∥∆∥ 2 2 . It follows from Bessel's inequality that ψk -ψ k ≤ C ∥∆ψ k ∥ + O ∥∆∥ 2 2 = O p n -1/2 . Hence, we obtain that ψk -ψ k = O p n -1/2 . PROOF OF (II) By (2.9) in Hall et al. (2006) and ∥ Ĝ -G∥ = O p n -1/2 , we obtain that λk -λ k = Ĝ -G (t, t ′ ) ψ k (t) ψ k (t ′ ) dtdt ′ + O ∥∆ψ k ∥ Hence, λk -λ k = O p n -1/2 is proved. PROOF OF (III) According to {x i (t) -m (t)} ϕ k (t) dx = λ k ξ ik , we have ξ ik = λ -1/2 k {x i (t) -m (t)} ψ k (t) dx. Similarly, ξik = λ-1/2 k {h i (t) -m (t)} ψk (t) dt. For 1 ≤ i ≤ n, ξik -ξ ik can be divided into two parts ξik -ξ ik = R 1 + R 2 where R 1 = λ-1/2 k {h i (t) -m (t)} ψk (t) dt - λ-1/2 k {x i (t) -m (t)} ψ k (t) dt R 2 = λ-1/2 k {x i (t) -m (t)} ψ k (t) dx -λ -1/2 k {x i (t) -m (t)} ψ k (t) dt We assume that for k ∈ N, λ k > 0, λk > 0. Moreover, according to the fact that ∥h i - Through first order Taylor expansion of λk at λ k , it is easy to obtain that λ-1/2 x i ∥ = O p n -1/2 , ψk -ψ k = O p n -1/ k = λ -1/2 k - (1/2) λ -3/2 k λk -λ k + O λk -λ k . Hence, λk -λ k = O p n -1/2 ensures that λ-1/2 k -λ -1/2 k = O p n -1/2 . Consequently, R 2 = λ-1/2 k -λ -1/2 k {x i (t) -m (t)} ψ k (t) dt ≤ λ-1/2 k -λ -1/2 k ∥x i -m∥ ∥ψ k ∥ = O p n -1/2 . where ∥x i -m∥ = O p (1). Therefore, max 1≤i≤n ξik -ξ ik = max 1≤i≤n (∥R 1 ∥ + ∥R 2 ∥) = O p n -1/2 .

H MORE NUMERICAL STUDY

To visualize covariance function of functional data generated from model (4) in simulation study, Figure 4 shows the true covariance function, the sample averaged covariance estimator using original data without sparsification, and four proposed covariance estimator generated from sparsified vectors. The deviation of RK and RK-SPAT estimators is large on the diagonal t = t ′ and the smoothness of the surfaces are poor. The accuracy of BS estimator is significantly improved, with only slight deviation at the boundary points. On the basis of RK estimator, RK-SPAT estimator further considers the spatial factor, such that the estimation accuracy of boundary points is improved but the smoothness of the surface is sacrificed to some extent. 



The code is attached to the supplementary material and will be publicly available once accepted. The superscript SPAT is applied to distinguish whether the statistic considering spatial factor. "d: Random-knots" AMSE( Ĝ) changes with d and Ĝ is the RK estimator; "n: Random-knots-Spatial" AMSE( Ĝ) changes with n and Ĝ is the RK-Spat estimator. Other curves are defined similarly. = O p n -1/2 .



The standard processx(•) allows Karhunen-Loève L 2 representation x(•) = m(•)+ ∞ k=1 ξ k ϕ k (•), in which the random coefficients, ξ k , called functional principal component (FPC) scores, are uncorrelated with mean 0 and variance 1. The rescaled eigenfunctions, ϕ k , called FPC, satisfyϕ k = √ λ k ψ k and {x(t) -m(t)}ϕ k (t)dt = λ k ξ k , for k ≥ 1. Although the sequences {λ k } ∞ k=1, {ϕ k } exist mathematically, they are either unknown or unobservable.

RELATED WORK Considerable efforts have been made to analyze first-order structure of function-valued random elements, i.e., the functional mean m (•). Estimation of mean function has been investigated in Jhunjhunwala et al. (2021), Garg et al. (2014), Suresh et al. (2017), Mayekar et al. (2021) and Brown et al. (2021). Cao et al. (2012) and Huang et al. (2022) considered empirical mean estimation using B-spline smoothing. The second-order structure of random functions -covariance function G (•, •) is the next object of interests. To the best of our knowledge, spatial correlation across nodes has not yet been considered in the context of sparsified covariance estimation. The research on sparsification has received wide attention recently, for instance Alistarh et al. (2018), Stich et al. (2018)

Figure 2.2 (a), on the left are original vectors {x i } n i=1 and right are sparsified vectors {h i } n i=1 randomly containing J s elements of the original vector with n = 3, d = 6, J s = 3. This definition tells that P (h ij = 0) = 1 -Js d , P (h ij = x ij ) = Js d . The estimator generated from {h i } n i=1 is called sparsified estimator which is obtained by replacing original trajectory x i = (x i1 , . . . , x id )

•). Coefficients estimation follows λ1,p,i , . . . , λJs+p,p,i ⊤ = argmin (λ1,p,...,λ Js +p,p )∈R Js+p d j=1 x ij -Js+p ℓ=1 λ ℓ,p B ℓ,p (j/d) 2 . The BS covariance estimator is obtained by replacing {h ij } n,d i=1,j=1 with B-spline trajectories. BSPLINE-SPATIAL (BS-SPAT) We replace {h ij } n,d i=1,j=1 in RK-SPAT estimator with B-spline estimation of trajectories. The BS-SPAT estimator not only considers the correlation among nodes, but also the correlation within a single node.Next theorem states the convergence rate of BS and BS-SPAT estimators.

(a) RK (RK-SPAT) Sparsification (b) BS (BS-SPAT) Sparsification Theorem 4. (BS (BS-SPAT) Estimation Error) Under Assumptions 1-6, the BS (BS-SPAT) estimator m(•) is asymptotically equivalent to m(•) up to order n 1/2 and similar conclusion holds for covariance function, i.e., as n → ∞,

Figure2reveals that AMSE( λ) decreases with the increase of n, while the change with d is small, AMSE( φ) exhibits the same regularity as AMSE( λ), in accordance with Theorem 5. Whether or not to consider spatial factors has greater impact on the accuracy of eigenvectors than eigenvalues.

Figure 1: Left, middle: AMSE( Ĝ) as a function of d, n. Right: AMSE Ḡ as a function of d, n.

Figure 2: Row 1: AMSE( λ) as a function of d, n. Row 2: AMSE( φ) as a function of d, n.

Figure 3: Eigenfunctions computed from the true covariance and four covariance estimators.

happens with probability Js d 2 , event {ξ ij = 0, ξ ij ′ = 1} happens with probability Js d

(Mj )T (M j ′ ) → p 1 from the law of large numbers andfrom Theorem 4 (i) Ĝ -Ḡ ∞ = O p n -1/2 , Ĝ is BS covariance estimator.G PROOF OF THEOREM 5 PROOF OF (I) The proposed RK and RK-SPAT covariance estimators follows Ĝ -Ḡ = O p (n -1/2 ) and the proposed BS and BS-SPAT covariance estimators follows Ĝ -Ḡ ∞ =

∥x i -m∥ ψk -ψ k + λ-1/2 k (∥h i -x i ∥ + ∥ m -m∥) ∥ψ k ∥ =O p n -1/2 .

Figure 4: Plots of true covariance, averaged covariance and four different covariance estimators.

In this paper, O p (or Op) denotes a sequence of random variables of certain order in probability and by O a.s. (or Oa.s.) almost surely O (or O). For sequences a n and b n , denote a n ≍ b n if a n and b n are asymptotically equivalent. For any Lebesgue measurable function ϕ(x) on a domain D, let ∥ϕ∥ ∞ = sup x∈D |ϕ(x)|.

, denote the estimator for response x ij by h ij (J). The trajectory estimates depend on the knot selection sequence, which are sparsified vectors for RK (RK-SPAT) estimator and B-spline smoothing vectors for BS (BS-SPAT) estimator. Then, Ĵs,i for the i-th curve is the one minimizing AIC, Ĵs,i = argmin J∈[1,J s * ] AIC (J) ,

Clustering Purity for unsupervised domain clustering. Best results in each row are marked in bold. Explanatory variables involved in GMM-k are set to be original data without PCA, standard PCA and FPC scores computed from RK(-SPAT) & BS(-SPAT) sparsified covariance estimators.

2 and∥ m -m∥ ≤ ∥ m -m∥ + ∥ m -m∥ = O p n -1/2 ,

A DECOMPOSITION

For estimation in FIXED-SPARSIFICATION method, we define the design matrix for B-spline regression as

