FED-COR: FEDERATED CORRELATION TEST WITH SE-CURE AGGREGATION

Abstract

In this paper, we propose the first federated correlation test framework compatible with secure aggregation, namely FED-COR. In FED-COR, correlation tests are recast as frequency moment estimation problems. To estimate the frequency moments, the clients collaboratively generate a shared projection matrix and then use stable projection to encode the local information in a compact vector. As such encodings can be linearly aggregated, secure aggregation can be applied to conceal the individual updates. We formally establish the security guarantee of FED-COR by proving that only the minimum necessary information (i.e., the correlation statistics) is revealed to the server. The evaluation results show that FED-COR achieves good accuracy with small client-side computation overhead and performs comparably to the centralized correlation test in several real-world case studies.

1. INTRODUCTION

Correlation test, as the name implies, is the process of examining the correlation between two random variables using observational data. It is a fundamental building block in a wide variety of real-world applications, including feature selection (Zheng et al., 2004) , cryptanalysis (Nyberg, 2001) , causal graph discovery (Spirtes et al., 2000) , empirical finance (Ledoit & Wolf, 2008; Kim & Ji, 2015) , medical studies (Kassirer, 1983) and genomics (Wilson et al., 1999; Dudoit et al., 2003) . Because the observational data used in the correlation tests may contain sensitive information such as genomic information, and collecting participants' information to a central repository poses a significant privacy risk. To address this problem, we utilize the federated setting, where each client maintains its own data and communicates with a central server to calculate a function. The communication transcript should contain as little information as feasible to prevent the server from inferring sensitive information. To motivate our work and ease the understanding of the problem setting, we consider a medical company that wants to study the correlation between genetic defects and races using the patients' private data from several hospitals. For a traditional method in the federated setting, the server, which is the medical company, will aggregate the hospitals' local private contingency tablesfoot_0 using secure aggregation (Bonawitz et al., 2017; Bell et al., 2020) . The company can conduct correlation tests with the aggregated global contingency table without directly accessing the individual hospitals' private data. Attentive readers might be aware that the method mentioned above leaks the joint distribution, which is the whole global contingency table, to the server. The joint distribution may contain sensitive information, and leaking it will probably violate privacy regulations. For instance, the medical company can observe the genetic distribution across races from the global table. The secure aggregation primarily supports linear aggregation. However, in correlation tests, the computation involves computing a summed p-th moment over the aggregated data, where p ∈ (0, 1) ∪ (1, 2]. Thus, the joint distribution will be leaked if we directly apply secure aggregation. To bridge the gap between secure aggregation and federated correlation tests, we take an important step towards designing non-linear secure aggregation protocols. Specifically, we design a federated protocol framework, namely FED-COR, optimized for a class of correlation tests, such as χ 2 -test and G-test. FED-COR is designed to have low computation and communication costs and only disclose information that is much less sensitive than the joint distribution. Our first insight is to recast correlation tests as frequency moment estimation problems. To approximate the frequency moments in a federated manner, each client collaborates with the other clients to generate a projection matrix and encodes its raw data into a low-dimensional vector via stable random projection (Indyk, 2006; Vempala, 2005; Li, 2008) . Such encodings can be aggregated with only summation, allowing clients to leverage secure aggregation to aggregate the encodings. The server then decodes the aggregated encoding to approximate the frequency moments. As secure aggregation conceals each client's individual update within the aggregated global update, the server learns only necessary information for the correlation test. To illustrate the power of FED-COR, we instantiate it with a representative correlation test, namely Pearson's χ 2 -test (Pearson, 1900) and refer to the concrete protocol as FED-χ 2 . We evaluate FEDχ 2 on 4 synthetic datasets and 16 real-world datasets. The results show that FED-χ 2 can replace centralized correlation tests with good accuracy. Compared to the traditional method with secure aggregation mentioned above, FED-χ 2 saves a factor of O(m) communication cost per client, where m is the size of the contingency tables. In FED-χ 2 , clients only need to upload a low-dimensional encoding with size ℓ ≪ m, while in the traditional method the clients will upload the complete contingency tables. Additionally, we analyze FED-χ 2 in two real-world use cases: feature selection and online false discovery rate control. The results show that FED-χ 2 can achieve comparable performance with centralized correlation tests and can withstand up to 20% of clients dropping out with only minor influence on the accuracy. Besides Pearson's χ 2 -test, we also demonstrate how to accommodate other commonly used correlation tests such as G-test in FED-COR. In summary, we make the following contributions: • We propose FED-COR, the first secure federated correlation test framework. FED-COR is computation-and communication-efficient and leaks much less information than directly using secure aggregation to collect the contingency table, which completely leaks the joint distribution. • FED-COR decomposes correlation test into frequency moments estimation that can easily be encoded/decoded using stable projection and secure aggregation techniques. We provide formal security proof and utility analysis of the protocol. • We demonstrate how to accommodate χ 2 -test and G-test in FED-COR, and empirically evaluate FED-χ 2 in several real-world use cases. The findings suggest that FED-χ 2 can substitute centralized χ 2 -test with comparable accuracy. Besides, FED-χ 2 can tolerate up to 20% of clients dropout with minor accuracy drop. We provide the code in the supplementary material for results verification.

2. RELATED WORK

There have been a line of works studying secure federated learning or statistics. Bonawitz et al. (2017) proposed the well-quoted secure aggregation protocol as a low-cost way to securely calculate linear functions in a federated setting. It has seen many variants and improvements since then. For instance, Truex et al. (2019) and Xu et al. (2019) employed advanced crypto tools for secure aggregation, such as threshold homomorphic encryption and functional encryption. So et al. (2021) proposed TURBOAGG, which combines secure sharing with erasure codes for better dropout tolerance. To improve communication efficiency, Bell et al. ( 2020) and Choi et al. (2020) replaced the complete graph in secure aggregation with either a sparse random graph or a low-degree graph. Secure aggregation is deployed in a variety of applications. Agarwal et al. (2018) added binomial noise to local gradients, resulting in both differential privacy and communication efficiency. Wang et al. (2020) replaced the binomial noise with discrete Gaussian noise, which is shown to exhibit better composability. Kairouz et al. (2021) proved that the sum of discrete Gaussian is close to discrete Gaussian, thus discarding the common random seed assumption from Wang et al. (2020) . The above three works all incorporate secure aggregation in their protocols to lower the noise scale required for differential privacy. Chen et al. (2020) added an extra public parameter to each client to force them to train in the same way, allowing for the detection of malicious clients during aggregation. Nevertheless, designing secure federated correlation tests, despite its importance in real-world scenarios, is not explored by existing research in this field.

3. METHODOLOGY

In this section, we elaborate on the design of FED-COR. Sec. 3.1 formalizes the problem, establishes the notation system, and introduces the threat model. In Sec. 3.2, we detail the design of FED-COR by instantiating FED-COR with Pearson's χ 2 -test, namely FED-χ 2 . In Sec. 3.3 and 3.4, we present the security proof, utility analysis, communication and computation analysis of FED-χ 2 .

3.1. PROBLEM SETUP

We now formulate the problem of the federated correlation test and establish the notation system. We use [n] to denote {1, • • • , n}. We denote vectors with bold lower-case letters (e.g., a, b, c) and matrices with bold upper-case letters (e.g., A, B, C). For the ease of representation, we use the example we mentioned in Sec. 1 to introduce all the notations. A medical company is studying the correlation between genetic defects (denoted by variable X) and race (denoted by variable Y ). The support domain of X (or Y ) is denoted by X (or Y). In the example, X = {yes, no} representing whether the participant has the genetic defect, and Y is the set of all races. We denote the size of X as m x , the size of Y as m y . The company wants to use the patient records from n hospitals to conduct the research. Concretely, each hospital holds a 2-dimensional local contingency table D i = {x ∈ X , y ∈ Y : v (i) xy ∈ {0}∪[M ]} , where x is the row label, y is the column label, and v (i) xy is the number of patients with the label (x, y). We use m = m x m y to denote the size of the contingency table . The first step of the traditional method in federated setting is to collect all the hospitals' contingency tables on a centralized server S of the company and sum them to obtain the global contingency table D = {x, y : v xy = i∈[n] v (i) xy }. The total number of samples with row label x (or column label y) is defined as v x = y∈Y v xy (or v y = x∈X v xy ). The total number of samples observed is v = x∈X ,y∈Y v xy . The next step is to calculate a test statistic, s(D), on the global table. For Pearson's χ 2 -test, the statistic is as below: s χ 2 (D) := x∈X ,y∈Y (vxy -vxy) 2 vxy , where vxy = vx×vy v is the expectation of v xy if X and Y are uncorrelated. The statistics is then compared with a threshold to decide whether X and Y are correlated. Attentive readers might be aware that the method described above incurs severe ethical issues that the patient records from different hospitals are collected on a centralized server of the company, which probably violates corresponding privacy regulations. In this work, our aim is to design a secure federated correlation test protocol only leaking non-sensitive information with low computation/communication cost. Concretely, we trade off accuracy for security, as long as the estimation error is small with a high probability. Formally, if FED-COR outputs ŝ, whose corresponding standard centralized correlation test output is s, the following accuracy requirement should be satisfied with small multiplicative error bound ϵ and small failure probability δ: P[(1 -ϵ)s ≤ ŝ ≤ (1 + ϵ)s] ≥ 1 -δ (2) Threat Model. We assume that the centralized server S is honest but curious. It honestly follows the protocol due to regulatory or reputational pressure but is curious to discover extra private information from clients' legitimate updates for profit or surveillance purposes. As a result, client updates should contain as little sensitive information as feasible. On the other hand, we assume the clients (e.g. the hospitals) are honest and won't collude with the server. Specifically, we do not consider client-side adversarial attacks (e.g., data poisoning attacks (Bagdasaryan et al., 2020; Bhagoji et al., 2019) ). However, we allow a small portion of clients to drop out during the execution. We also provide further security analysis when collusion between the server and the client happens in Appendix G. More importantly, we assume that the marginal distributions of the variables are not sensitive while the joint distribution is. The above example is a natural case where such an assumption holds. The aggregated marginal distributions of the genetic defects and the races won't leak sensitive information. However, the correlation between a specific pair of race and genetic defect can be easily observed if the joint distribution, which is the aggregated global contingency table, is obtained by the server.

3.2. FEDERATED CORRELATION TEST WITH SECURE AGGREGATION

In this section, we introduce the design of FED-COR in detail by instantiating FED-COR with Pearson's χ 2 -test. We also discuss how the design generalizes to other statistical tests such as G-test (SOKAL et al., 1995) in Sec. 5. From Federated Correlation Test to Frequency Moments Estimation. The α-th frequency moment of a key-value stream is formally defined as below: Definition 1 (α-th frequency moment). Given a key-value stream {a t ∈ A, b t ∈ B} t∈[T ] , the α-th frequency moment of S is defined as: Fα(S) := a∈A ( t∈[T ]:a t =a bt) α (3) We observe that the test statistics of many correlation tests can be rewritten as frequency moments. For example, the statistic of χ 2 -test can be reformatted as a second frequency moment: s χ 2 (D) = x,y (vxyvxy) 2 vxy = x,y ( vxy -vxy √ vxy ) 2 In the federated setting, the i th client calculates the vector u i (x, y) := v (i) xy -vxy/n √ vxy , and the above formula can be rewritten as a second frequency moment estimation problem: s χ 2 (D) = x,y ( vxy -vxy √ vxy ) 2 = x,y ( i∈[n] ui(x, y)) 2 Federated Frequency Moments Estimation. Now that we have reformatted the problem, the second step is to design the messages transmitted in FED-COR for α th frequency moments estimation. We choose stable projection (Indyk, 2006; Vempala, 2005) to encode the client-side information and geometric mean estimator (Li, 2008) to decode the aggregated message. Before we dive into the details, let's refresh some preliminaries. See Appendix A for more details on stable distribution. Definition 2 (Symmetric α-stable distribution). A random variable X follows a symmetric α-stable distribution Q α,β,F if its characteristic function is as follows: ϕX (t) = exp(-F |t| α (1 - √ -1β sgn(t) tan( πα 2 ))), ( ) where F is the scale, α th ∈ (0, 2] is the stability parameter, and β is the skewness. α-stable distribution is named due to its property called α-stability. Briefly, the sum of independent α-stable variables still follows an α-stable distribution with a different scale. Definition 3 (α-stability). If random variables X ∼ Q α,β,1 , Y ∼ Q α,β,1 and X and Y are indepen- dent, then C 1 X + C 2 Y ∼ Q α,β,C α 1 +C α 2 . Inspired by the idea of Indyk's well-cited paper (Indyk, 2006) , we encode the frequency moments in the scale parameter of a stable distribution. To encode information contained in the local contingency table D i , the i th client collaborates with other clients to generate a projection matrix P ∈ R ℓ×m projection matrix, where ℓ is the encoding size. The components of P are drawn independently from an α-stable distribution Q α,0,1 . The client then calculates u i as defined in Eq. 5 and applies the projection get e i := P × u i as the encoding (lines 1-2 in Alg. 1). To decode, the server first sums the encodings from all the clients e := i∈[n] e i . According to the α-stability defined in Definition 3, every component e k in the encoding vector e, k ∈ [ℓ], follows this stable distribution Q α,0,s(D) . Thus, the statistic of the correlation test can be estimated with the scale of the distribution. We estimate the scale using an unbiased geometric mean estimator (Li, 2008) (lines 3-4 in Alg. 1). A significant advantage of stable projection is that the encodings are linearly aggregatable and thus compatible with secure aggregation. Secure aggregation only reveals the aggregated encoding to the server and greatly reduces the privacy leakage. Furthermore, in Sec. 3.4, we show that a small encoding size suffices to accurately approximate the frequency moments with high probability and can potentially improve communication cost with certain setups. Algorithm 1 The encoding and decoding scheme (Indyk, 2006) for federated frequency moments estimation. Note that the encoding and decoding themselves do not provide any security guarantee. Function EN C O D E(P, ui): return P × ui Function DE C O D E(e): return ℓ k=1 |e k | 2/ℓ ( 2 π Γ( 2 ℓ )Γ(1-1 ℓ ) sin( π ℓ )) ℓ // ℓ is the encoding size. Algorithm 2 The complete FED-χ 2 protocol. SECUREAGG is a remote procedure that receives inputs from the clients and returns the summation to the server. INITSECUREAGG is the corresponding setup protocol deciding the communication graph and other hyper-parameters. Round 1: Reveal the marginal statistics Sample the projection matrix P from Q ℓ×m 2,0,1 using the common random seed r IN I TSE C U R EAG G(n) // n: clients number for x ∈ [mx] do vx = SE C U R EAG G({v (i) x } i∈[n] ) for y ∈ [my] do vy = SE C U R EAG G({v (i) y } i∈[n] ) Server Calculate v = x vx Calculate ei = EN C O D E(P, ui) e = SE C U R EAG G(QU A N T I Z E({ei} i∈[n] )) Server ŝχ 2 = DE C O D E(e) FED-χ 2 Protocol. We instantiate FED-COR with Pearson's χ 2 -test, and the complete FED-χ 2 protocol is presented in Alg. 2. Firstly, the marginal statistics v x , v y and v are collected with secure aggregation and broadcasted to all the clients (lines 1-6 of Alg. 2). This step can be omitted if the marginal statistics are already known. The i th client calculates u i (lines 9-10 of Alg. 2), and samples a random seed r i and broadcasts to other clients (line 11 of Alg. 2). Then, the clients receive the random seeds and sample the projection matrix P from the α-stable distribution Q ℓ×m 2,0,1 using the common random seed r (lines 12-13 of Alg. 2). The i th client projects u i to obtain the encoding e i (line 14 of Alg. 2). Then, the encodings are quantized and aggregated with secure aggregation (line 15 of Alg. 2). As we have already known the marginal statistics in the first round, the quantization bound can be set accordingly. Additionally, we can use high precision for quantization, such as 64 bits, such that the precision of the quantized float numbers is comparable to or even better than the IEEE floating numbers. We validate this conjecture with empirical evaluation and hence ignore the effect of quantization on accuracy in the analysis. In the last step, the server gets the χ 2 -test statistics using the decoding algorithm described in Alg. 1 (line 17 of Alg. 2). Remark: Client Dropout. Attentive readers might ask what if some clients drop out during the protocol execution? We argue that dropouts in the first round have no effect on the test's accuracy as long as the secure aggregation used is resilient to dropout, such as (Bonawitz et al., 2017; Bell et al., 2020) . On the other hand, dropouts in the second round will affect the accuracy of the test. However, since the χ 2 value is typically far from the decision threshold, FED-χ 2 is intrinsically robust to a small portion of clients dropping out (see Section 4 for empirical assessment). Remark: The Selection of Secure Aggregation. As introduced in Sec. 2, there are a variety of secure aggregation protocols for different setups (Bonawitz et al., 2017; Truex et al., 2019; Xu et al., 2019; So et al., 2021; Bell et al., 2020; Choi et al., 2020) . In the rest of the paper, we choose the state-of-the-art cross-device secure aggregation protocol by Bell et al. (2020) due to its simple trust assumption and low communication cost. We want to emphasize that FED-COR can incorporate any secure aggregation protocols as needed.

3.3. SECURITY ANALYSIS

We now prove the security enforced by Alg. 2 via a standard simulation proof process (Lindell, 2017) on the basis of Theorem 1. Theorem 1 (Security). Let Π be an instantiation of Alg. 2 with the secure aggregation protocol in Alg. 4 of Appendix B with cryprographic security parameter λ. There exists a PPT simulator SIM such that for all clients C, the number of clients n, all the marginal distributions v x , v y , and the aggregated encoding e, the output of SIM is indistinguishable from the view of the real server Π C in that execution, i.e., Π C ≈ λ SIM(e, v x , v y , n). Intuitively, Theorem 1 illustrates that no more information about the clients except the aggregated updates is revealed to the centralized server. Note that this is the minimal necessary information for the server to estimate the test statistic. The complete proof for Theorem 1 is deferred to Appendix D. To further emphasize the privacy protection of our protocol, we also provide analysis on the leakage when the server colludes with a client in Appendix G. We show that even the collusion happens, our protocol can still successfully hide the information in a subspace with exponential possible distributions, which practically enforce privacy given the considerably large size of the solution space.

3.4. UTILITY, COMMUNICATION & COMPUTATION ANALYSIS

We first present the utility analysis of FED-χ 2 in Alg. 2. We show that the output of FED-χ 2 , ŝχ 2 , is a fairly accurate approximation (parameterized by ϵ) to the correlation test output s χ 2 in the standard centralized setting with high probability parameterized by δ when ℓ is appropriately chosen. The proof is deferred to Appendix E. Theorem 2 (Utility). Let Π be an instantiation of Alg. 2 with secure aggregation protocol in Alg. 4 of Appendix B. Π is parameterized with ℓ = c ϵ 2 log(1/δ) for some constant c. After executing Π C on all clients C, the server yields ŝχ 2 , whose distance to the accurate correlation test output s χ 2 is bounded with high probability as follows: P[ŝ χ 2 < (1 -ϵ)s χ 2 ∨ ŝχ 2 > (1 + ϵ)s χ 2 ] ≤ δ (7) Then we present the communication and computation cost of Alg. 

4. EVALUATION

Experiment Setup. To assess FED-χ 2 's accuracy, we simulate it on four synthetic datasets and 12 real-world datasets. We compare the multiplicative error ε := |ŝ χ 2 (D)-s χ 2 (D)| s χ 2 (D) and power of FED-χ 2 with that of the standard centralized χ 2 -test. The four synthetic datasets are independent, linearly correlated, quadratically correlated, and logistically correlated. For the real-world datasets, we report the details in Appendix H. We evaluate FED-χ 2 's utility in two real-world application scenarios: feature selection and online false discovery rate (FDR) control. For feature selection, we report the model accuracy trained on the selected features. For online FDR control, we report the average false discovery rate. We compare the performance of FED-χ 2 with that of the centralized χ 2 -test in each of the three experiments. For secure aggregation, we discretize all the real numbers to 64-bit fix-point numbers. We provide further evaluation on the influence of finite field size in Appendix M, which shows that FED-χ 2 is numerically stable under different finite field sizes. Unless otherwise specified, experiments are launched on an Ubuntu 18.04 LTS server equipped with 32 AMD Opteron(TM) Processor 6212 and 512GB RAM.

4.1. EVALUATION RESULTS

Accuracy. We begin by evaluating the accuracy of FED-χ 2 , as illustrated in Fig. 1 . Each point represents the mean of 100 independent runs with 100 clients, while the error bars indicate the standard deviation. We choose m x = m y = 20 in this experiment. Note that the accuracy drop of FED-χ 2 is independent of the number of clients. From Fig. 1 , we observe that the larger the encoding size ℓ, the smaller the multiplicative error. When ℓ = 50, the multiplicative error ϵ ≈ 0.2. This conforms with Theorem 2, in which the multiplicative error ϵ = c ℓ log(2/δ) decreases as ℓ increases. We also evaluate the power (Cohen, 2013) of FED-χ 2 . We set the p-value threshold as 0.05. From the dashed lines in Fig. 1 , we can tell that the power of FED-χ 2 is high. This conforms with our observation on the multiplicative errors. Specifically, since the χ 2 values are typically far from the decision threshold, a multiplicative error of 0.2 rarely flips the final decision. We also present the results when 5% of clients drop out in the second round of FED-χ 2 in Fig. 1 . The results show that FED-χ 2 is robust to a small portion of dropouts. In Appendix J, we present the results in terms of 10%, 15%, and 20% dropout rates. The results further show that FED-χ 2 can tolerate a considerable portion of clients dropout in Round 2 of Alg. 2. Client-side Computation Overhead. To assess extra computation overhead incurred by FED-χ 2 on the client side, we measure the execution time of the encoding scheme on an Android 10 mobile device equipped with a Snapdragon865 CPU and 12GB RAM. We use PyDroid (Sandeep Nandal, 2020) to run the client-side computation of FED-χ 2 on the Android device. As shown in Fig. 2 , each point represents the average of 100 separate runs, with accompanying error bars. The overhead is generally negligible. For example, for a 500 × 500 contingency table, the encoding takes less than 30ms. The overhead grows linearly in relation to m x (m y ) and consequently quadratically in Fig. 2 , where m x = m y .

4.2. DOWNSTREAM USE CASE STUDY

Feature Selection. Our first case study explores secure federated feature selection using FED-χ 2 . The setting is that each client holds data with a large feature space and wants to collaborate with other clients to rule out unimportant features and retain features with top-k highest χ 2 scores. We use Reuters-21578 (Hayes & Weinstein, 1990) , a standard text categorization dataset (Yang, 1999; Yang & Pedersen, 1997; Zhang & Yang, 2003) , and pick the top-20 most frequent categories using 17,262 training and 4,316 test documents. These documents are distributed randomly to 100 clients, each of whom receives the same number of training documents. After removing all numbers and stop-words, we obtain 167,135 indexing terms. The contingency table is of size 2×20 where 2 corresponds to whether a term occurs in an article and 20 corresponds to the number of different article categories. After performing feature selection using FED-χ 2 , we select the top 40,000 terms with the highest χ 2 scores. When compared with the centralized χ 2 -test, 38,012 (95.03%) of the selected terms are identical, indicating that FED-χ 2 produces highly consistent results with the standard χ 2 -test. We then train logistic regression models using the terms selected by FED-χ 2 and the centralized χ 2 -test, respectively. All hyper-parameters are the same. The details of these models are reported in Appendix I. The training and testing splits are the same for FED-χ 2 , centralized χ 2 -test, and model without feature selection (i.e. there are 17,262 training and 4,316 test documents). We use the same learning rate; random seed and all other settings are also the same to make the comparison fair. We get the result of Fig. 3 and the models are all trained on NVIDIA GeForce RTX 3090. The results in Fig. 3 further demonstrate that FED-χ 2 exhibits comparable performance with the centralized χ 2 -test. When 10% and 20% of clients dropout in the second round of FED-χ 2 , the accuracy of the trained model using the features selected by FED-χ 2 does not drop much. We also examine performance without feature selection, and as expected, model accuracy is significantly greater after feature selection. Note that the model without feature selection has 2,542,700 more parameters than the model with feature selection. Hence, feature selection effectively improves model accuracy while reducing model size and computational cost. We also provide further evaluation on the influence of encoding size ℓ in Appendix L, which shows that FED-χ 2 can achieve comparable performance with the centralized χ 2 -test under different ℓ. Online False Discovery Rate Control. In the third case study, we explore federated online false discovery rate (FDR) control (Foster & Stine, 2008) with FED-χ 2 . In an online FDR control problem, a data analyst receives a stream of hypotheses on the database, or equivalently, a stream of p-values: p 1 , p 2 , • • • . At each time t, the data analyst should pick a threshold α t to reject the hypothesis when p t < α t . The error metric is the false discovery rate, and the objective of online FDR control is to ensure that for any time t, the FDR up to time t is smaller than a pre-determined quantity. We use the SAFFRON procedure (Ramdas et al., 2018) , the state-of-the-art online FDR control, for multiple hypothesis testing. The χ 2 results and corresponding p-values are calculated by FED-χ 2 . We present the SAFFRON algorithm in Appendix C. Each time, there are 100 independent hypotheses, with a probability of 0.5 that each hypothesis is either independent or correlated. The time sequence length is 100, and the number of clients is 10. The data are synthesized from a multivariate Gaussian distribution. For the correlated data, the covariance matrix is randomly sampled from a uniform distribution. For the independent data, the covariance matrix is diagonal, and its entries are randomly sampled from a uniform distribution. At time t, we use FED-χ 2 to calculate the p-values p t of all the hypotheses, and then use the SAFFRON procedure to estimate the reject threshold α t using p t . The relationship between the average FDR and encoding size ℓ is shown in Fig. 4 . We observe that the variance of independent runs is very small, so we omit the error bars. FED-χ 2 achieves good performance (FDR lower than 10%) when the encoding size l is larger than 200. In Fig. 4 , we also provide the FDR result of the centralized χ 2 -test as well as the true discovery rate (TDR, i.e., #correct reject / #should reject). In addition, we provide statistics for each encoding size l that was evaluated in Appendix K. The results indicate that by increasing the encoding size ℓ, FED-χ 2 can achieve comparable performance to the centralized χ 2 -test. The results further demonstrate that FED-χ 2 can be employed in practice to facilitate online FDR control. 5 DISCUSSION: CORRELATION TESTS BEYOND χ 2 -TEST Pearson's χ 2 -test is not the only correlation test compatible with FED-COR. To demonstrate the extensibility of FED-COR, we show how to recast G-test (SOKAL et al., 1995) to a frequency moments estimation problem. The reduction is more involved as the statistics in G-test contains a logarithmic term, and we rewrite s G as shown below: sG(D) = 2 x,y vxy log vxy vxy = 2 x,y vxy log vxy -2 x,y vxy log vxy Similar to χ 2 -test, vxy = vx×vy v is the expectation of v xy if X and Y are uncorrelated. The first term can be approximated using the following formula (Zhao et al., 2007) with small ∆: x,y vxy log vxy = 1 2∆ ( x,y v 1+∆ xy - x,y v 1-∆ xy ) In this way, we recast G-test to two frequency moments estimation of orders 1 + ∆ and 1 -∆. The rest of the protocol is the same as FED-χ 2 in Alg. 2 except that we estimate two frequency moments.

6. CONCLUSION & FUTURE WORKS

This paper takes an important step towards designing non-linear secure aggregation protocols in the federated setting. Specifically, we propose a universal secure protocol to evaluate frequency moments in the federated setting. We focus on an important application of the protocol: correlation test. We give formal security proof and utility analysis on our proposed protocol and validate them with empirical evaluations and downstream use case studies. We also discuss a potential future direction. We deem it promising to provide stronger privacy guarantee for FED-COR by incorporating differential privacy techniques like differentially private frequency moments estimation (Wang et al., 2021) or adding calibrated discrete Gaussian noise (Canonne et al., 2020) to the users' local updates. 

C SAFFRON PROCEDURE REFRESHER

In Sec. 4.2, we adopt the SAFFRON procedure (Ramdas et al., 2018) to perform online FDR control. SAFFRON procedure is currently the state of the arts for multiple hypothesis testing. In Alg. 5, we formally present the SAFFRON algorithm. The initial error budget for SAFFRON is (1 -λ 1 W 0 ) < (1 -λ 1 α), and this will be allocated to different tests over time. The sequence {λ j } ∞ j=1 is defined by g t and λ j serves as a weak estimation of α j . g t can be any coordinate wise non-decreasing function (line 8 in Alg. 5). R j := I(p j < α j ) is the indicator for rejection, while C j := I(p j < λ j ) is the indicator for candidacy. τ j is the j th rejection time. For each p t , if p t < λ t , SAFFRON adds it to the candidate set C t and sets the candidates after the j th rejection (lines 9-10 in Alg. 5). Further, the α t is updated by several parameters like current wealth, current total rejection numbers, the current size of the candidate set, and so on (lines 11-14 in Alg. 5). Then, the decision R t is made according to the updated α t (line 15 in Alg. 5). The hyper-parameters for the SAFFRON procedure in online false discovery rate control of Sec. 4 are aligned with the setting in Ramdas et al. (2018) . The target FDR level is α = 0.05, the initial wealth is W 0 = 0.0125, and γ j is calculated in the following way: γ j = 1/(j+1) 1.6 10000 j=0 1/(j+1) 1.6 .

D PROOF FOR THEOREM 1

Proof for Theorem 1. To prove Theorem 1, we need the following lemma. Lemma 1 (Security of secure aggregation protocol). Let SECUREAGG be the secure aggregation protocol in Alg. 4 of Appendix B instantiated with cryprographic security parameter λ. There exists a probabilistic polynomial-time (PPT) simulator SIMSA such that for all clients C, the number of clients n, and the aggregated encoding e, the output of SIMSA is perfectly indistinguishable from the view of the real server, i.e., SECUREAGG C ≈ λ SIMSA(e, n). Lemma 1 is derived from the security analysis of our employed secure aggregation protocol (Theorem 3.6 in Bell et al. (2020) ), which establishes that the secure aggregation protocol securely conceals the individual information in the aggregated result. With this lemma, we are able to prove the theorem for federated correlation test by presenting a sequence of hybrids that begin with real protocol execution and end with simulated protocol execution. We demonstrate that every two consecutive hybrids are indistinguishable, illustrating that the hybrids are indistinguishable according to transitivity. HYB 1 This is the view of the server in the real protocol execution, REAL C . HYB 2 In this hybrid, we replace the view during the execution of each SECUREAGG({v (i) x } i∈[n] ) in line 3 of Alg. 2 with the output of SIMSA(v x , n) one by one correspondingly. According to Lemma 1, each replacement does not change the indistinguishability. Hence, HYB 2 is indistinguishable from HYB 1 . Theorem 4 (Computation Cost). Let Π be an instantiation of Alg. 2 with secure aggregation protocol from Bell et al. (2020) , then (1) the client-side computation cost is O(m x log n + m y log n + ℓ log n + mℓ); (2) the server-side computation cost is O(m x + m y + ℓ). Proof sketch for Theorem 4. Each client computation can be broken up as k key agreements (O(k) complexity, line 9 in Alg. 4), generating masks m i,j for all neighbors c j (O(k(m x + m y + ℓ)) complexity, lines 3, 4, 15 in Alg. 2 and line 10 in Alg. 4), sampling encoding matrix P cost O(mℓ), line 13 in Alg. 2, and encoding computation cost O(mℓ) (line 14 in Alg. 2). Thus, the client computation cost is O(m x log n + m y log n + ℓ log n + mℓ). The server-side follows directly from the semi-honest computation analysis in Bell et al. (2020) . The extra O(ℓ) term is the complexity of the geometric mean estimator.

G FURTHER SECURITY ANALYSIS WHEN COLLUSION HAPPENS

We have shown that Alg. 2 provides strong security guarantee when there is no collusion between the clients and the server. That is, the server only knows the non-private marginal distribution of the contingency table and the final aggregated results. In the following section, we will analyze the leakage of Alg. 2 when the collusion happens to demonstrate that FED-χ 2 provides strong privacy guarantee and also help the readers better understand our protocol. Remark: what does Alg. 2 leak when collusion between the server and the client happens? If the server colludes with one client, then it knows the random seed r (line 12 of Alg. 2) used to generate the projection matrix P. In the following, we will analyze the leakage of client private data when the server knows P. By Theorem 1, we show that individual updates of clients are perfectly hidden in the aggregated results and FED-χ 2 leaks no more than a linear equation system if the server knows P:      P × v = e T J 1,my × V T = v T x J 1,mx × V = v T y , where J 1,mx and J 1,my are 1 × m x and 1 × m y unit matrices, V is an m x × m y matrix whose elements are {v xy }, and v is the flattened vector of V. To understand (15), v (or V) is sensitive and all the other matrices and vectors are already known to the server. Also note that due to the requirement of secure aggregation, all the values in ( 15) are discretized into a finite field. Thus, the server can solve the system of equations ( 15) on a finite field to get information about v. The following theorem establishes an important fact: the above equation system has a large solution space, which conceals the real joint distribution. Proposition 1. Given a projection matrix P ∈ Z ℓ×m q , v x ∈ Z mx q , v y ∈ Z my q and e ∈ Z ℓ q , if m > ℓ + m x + m y , there are at least q m-ℓ-mx-my solutions to the system of equations (15). Proof sketch for Proposition 1. The system of linear equations on Z q contains m x +m y +ℓ equations and m variables. Given m > m x + m y + ℓ, the rank of the coefficient matrix is no more than m x + m y + ℓ. According to the Rouché-Capelli theorem (Brunetti & Renato, 2014) on finite fields, the solution forms a at least m -m x -m y -ℓ-dimensional traslation of subspace of Z m q . As a result, we know that the solution space contains at least q m-ℓ-mx-my solution vectors. Theorem 1 shows an important fact that the joint distribution is hidden in a subspace with exponential possible distributions. Although the collusion between the client and the server is not likely to happen in the cross-silo federated settings (consider our example in Sec. 3.1) and thus not considered in our threat model, we still show that Alg. 2 practically enforce privacy given the considerably large size of the solution space.

H DETAILS OF DATASETS

The details for the real-world datasets used in Sec. 4.1 are provided in Table 1 . The license of Credit Risk Classification (Govindaraj, Praveen) is CC BY-SA 4.0, the license of German Traffic Sign (Houben et al., 2013) is CC0: Public Domain. Other datasets without a license are from UCI Machine Learning Repository (Dua & Graff, 2017) . random seed and all other settings are also the same to make the comparison fair. We get the result of Fig. 3 and the models are all trained on NVIDIA GeForce RTX 3090. J FURTHER RESULTS ON FED-χ 2 WITH DROPOUTS We present the results of 10%, 15%, and 20% clients dropout in Fig. 5 . The results further show that FED-χ 2 can tolerate a considerable portion of clients dropout in Round 2 of Alg. 2.

K FURTHER RESULTS FOR ONLINE FDR CONTROL

In this section, we provide further results for online FDR control. As we have shown in Fig. 4 , FED-χ 2 achieves good performance when the encoding size l is larger than 200. In addition, we provide statistics for each encoding size l that was evaluated in Table 3 . These results demonstrate that FED-χ 2 performs well and is comparable to the centralized χ 2 -test when the encoding size l is increased.

L FURTHER RESULTS FOR FEATURE SELECTION

Our results in Sec. 4.2, paragraph Feature Selection, demonstrate that FED-χ 2 performs well when encoding size l = 50. We conduct experiments with different encoding sizes l to further assess their effect on FED-χ 2 's performance. In Fig. 6 , we present the effect of encoding size l on the ratio of the commonly-selected features between the original centralized χ 2 -test and FED-χ 2 . A larger ratio of commonly-selected features means that FED-χ 2 performs more closely to the original centralized χ 2 -test. And if the ratio is 1, these two algorithms select the identical features. The results in Fig. 6 show that when the encoding size l increases, the performance of FED-χ 2 approaches that of the original centralized χ 2 -test. Similar to Sec. 4.2, we evaluate FED-χ 2 's performance under different encoding sizes l by training the model with the features selected by FED-χ 2 . Fig. 7 shows the results. When trained with FEDχ 2 -selected features, the model can achieve comparable accuracy to the model trained with features 



Contingency table contains the frequency distribution of the variables; see(Wikipedia, 2021).



and Broadcast v, {vx}, and {vy} to all the clients Round 2: Approximate the statistics Client i ∈ [n] Calculate vxy = vxvy v Prepare ui s.t. ui(x, y) = v (i) xy -vxy /n √ vxy Randomly sample a random seed ri and broadcast to all the other clients Collect the random seeds from the other clients and obtain the shared random seed r = i ri

Theorem 3 (Communication Cost). Let Π be an instantiation of Alg. 2 with secure aggregation protocol in Alg. 4 of Appendix B, then (1) the client-side communication cost is O(log n + m x + m y + ℓ); (2) the server-side communication cost is O(n log n + nm x + nm y + nℓ). Theorem 4 (Computation Cost). Let Π be an instantiation of Alg. 2 with secure aggregation protocol in Alg. 4 of Appendix B, then (1) the client-side computation cost is O(log2 n + (ℓ + m x + m y ) log n + mℓ); (2) the server-side computation cost is O(n log 2 n + n(ℓ + m x + m y ) log n + ℓ).Note that compared with the original computation cost presented in(Bell et al., 2020), the client-side overhead has an extra O(mℓ) term. This term is incurred by the encoding overhead. We also give an empirical evaluation on the client-side computation overhead in Sec. 4.1. Please refer to Appendix F for the detailed proof of Theorem 3 and Theorem 4.

Figure 1: Multiplicative error and power of FED-χ 2 w.r.t. encoding size ℓ with and without dropout.

Figure 2: Client-side encoding overhead.

Figure 3: Accuracy of models trained with features selected by FED-χ 2 and centralized χ 2 -test.

Figure 4: FDR & TDR w.r.t. ℓ for SAFFRON.

Definition 4 (HARARY(n, k) Graph). Let HARARY(n, k) denotes a graph with n nodes and degree k. This graph has vertices V = [n] and an edge between two distinct vertices i and j if and only if j -i (mod n) ≤ (k + 1)/2 or j -i (mod n) ≥ n -k/2.

Dataset details. details of the regression models trained in feature selection in Sec. 4.2 is reported in Table2. The training and testing splits are the same for FED-χ 2 , centralized χ 2 -test and model without feature selection (i.e. there are 17,262 training and 4,316 test documents). We use the same learning rate;

Detailed results for online FDR control.

APPENDIX A STABLE DISTRIBUTION REFRESHER

A non-degenerate distribution is said to be stable if for X and Y sampled from the distribution, aX + bY for some constants a, b > 0 has the same distribution up to location and scale parameters. Paul Léby first systematically study the stable distribution family in his master piece: Calcul des probabilités (Lévy & Lévy, 1925 ) so stable distribution is also referred to as Léby α-stable distribution. Stable distributions are parameterized by location µ, scale F , the stability parameter α and the skewness β. When α ̸ = 1, the characteristic function is as below:When α = 1, the characteristic function is given by:In the main text, we only consider a subset of stable distributions where µ = 0 and α ̸ = 1.Stable distribution family contains many familiar distributions. For example, 1-stable distribution is Cauchy distribution, 2-stable distribution is Gaussian distribution, and 1/2-stable distribution is known as Levy distribution.Stable distributions also have discrete analogues defined by their probability generating functionwhere F is the scale and α is the stability parameter. However, for discrete stable distribution, the support domain of α is (0, 1] instead of (0, 2]. In the second step, the edges of the graph determine pairs of clients, each of which runs key agreement protocols to share random keys. The random keys will be used by each party to derive a mask for her input and enable dropouts.

B SECURE AGGREGATION REFRESHER

In the third step, each client c i , i ∈ A 1 sends secret share to its neighbors. In the fourth step, the server checks whether the clients dropout exceeds the threshold δ, and lets the clients know their neighbors who didn't dropout.In the fifth step, each pair (i, j) of connected clients in G runs a λ-secure key agreement protocol s i,j = KA.Agree(sk 1 i , pk 1 j ) which uses the key exchange in the previous step to derive a shared random key s i,j . The pairwise masks m i,j = F (s i,j ) can be computed, where F is the pseudorandom generator (PRG). If the semi-honest server announces dropouts and later some masked inputs of the claimed dropouts arrive, the server can recover the inputs. To prevent this happening, another level of masks, called self masks, r i is added to the input. Thus, the input of client c i is:Steps 6-8 deal with the clients dropout by recovering the self masks r i of clients who are still active and pairwise masks m i,j of the clients who have dropped out. Finally, the server can cancel out the pairwise masks and subtract the self masks in the final sum:HYB 3 Similar to HYB 2 , we replace the view during the execution of each SECUREAGG({v) in line 4 of Alg. 2 with the output of SIMSA(v y , n) one by one. According to Lemma 1, HYB 3 is indistinguishable from HYB 2 . HYB 4 In this hybrid, we replace the view during the execution of SECUREAGG({e i } i∈ [n] ) in line 15 of Alg. 2 with the SIMSA(e, n). This hybrid is the output of SIM. According to Lemma 1, HYB 4 is indistinguishable from HYB 3 .Algorithm 4 SECUREAGG: Secure Aggregation Protocol. (Algorithm 2 from Bell et al. ( 2020))▷ We denote by A 1 , A 2 , A 3 the sets of clients that reach certain points without dropping out. Specifically A 1 consists of the clients who finish step (3), A 2 those who finish step (5), and A 3 those who finish step (7). For each A i , A ′ i is the set of clients for which the server sees they have completed that step on time.(1) The server runswhere G is a regular degree-k undirected graph with n nodes. By N G (i) we denote the set of k nodes adjacent to c i (its neighbors).(2) Client c i , i ∈ [n], generates key pairs (sk 1 i , pk 1 i ), (sk 2 i , pk 2 i ) and sends (pk 1 i , pk 2 i ) to the server who forwards the message to N G (i).(3• Computes two sets of shares:• Sends to the server a message m = (j, c i,j ), where c i,j = E auth .Enc(k i,j , (i||j||h b i,j ||h s i,j )), k i,j = KA.Agree(sk 2 i , pk 2 j ), for each j ∈ N G (i). (4) The server aborts if |A ′ 1 | < (1 -δ)n and otherwise forwards (j, c i,j ) to client c j who deduces• Computes a shared random PRG seed s i,j as s i,j = KA.Agree(sk 1 i , pk 1 j ). • Computes masks m i,j = F (s i,j ) and r i = F (b i ).• Sends to the server their masked input 

E PROOF FOR UTILITY

Proof for Theorem 2. First, we introduce the following lemma from Li (2008) .Lemma 2 (Tail bounds of geometric mean estimator (Li, 2008) ). The right tail bound of geometric mean estimator is:where), and γ e is the Euler's constant. The left tail bound of the geometric mean estimator is:where)), and

F PROOF FOR COMMUNICATION & COMPUTATION COST

In this section, we prove Theorem 3 and Theorem 4. The server receives or sends O(log n + m x + m y + ℓ) messages to each client, so the server communication cost is O(n log n + nm x + nm y + nℓ).Algorithm 5 SAFFRON Procedure. selected by the original centralized χ 2 -test. Also, consistent with the results in Fig. 3 in Sec. 4.2, we see that when the encoding size l ≥ 25, models trained by FED-χ 2 -selected features achieve higher accuracy than that of the models without feature selection. These results further demonstrate the effectiveness of FED-χ 2 .

M INFLUENCE OF FINITE FIELD SIZE

As shown in Fig. 8 , we test the performance of FED-χ 2 under different finite field size q. We observe that when q ∈ {2 16 , 2 32 , 2 64 }, there is almost no difference in the performance. The result shows that FED-χ 2 is numerically stable.Encoding Size ℓ Common Feature Ratio (%) Figure 6 : Ratio of commonly-selected features between FED-χ 2 and original centralized χ 2 -test. 

