FED-COR: FEDERATED CORRELATION TEST WITH SE-CURE AGGREGATION

Abstract

In this paper, we propose the first federated correlation test framework compatible with secure aggregation, namely FED-COR. In FED-COR, correlation tests are recast as frequency moment estimation problems. To estimate the frequency moments, the clients collaboratively generate a shared projection matrix and then use stable projection to encode the local information in a compact vector. As such encodings can be linearly aggregated, secure aggregation can be applied to conceal the individual updates. We formally establish the security guarantee of FED-COR by proving that only the minimum necessary information (i.e., the correlation statistics) is revealed to the server. The evaluation results show that FED-COR achieves good accuracy with small client-side computation overhead and performs comparably to the centralized correlation test in several real-world case studies.

1. INTRODUCTION

Correlation test, as the name implies, is the process of examining the correlation between two random variables using observational data. It is a fundamental building block in a wide variety of real-world applications, including feature selection (Zheng et al., 2004) , cryptanalysis (Nyberg, 2001) , causal graph discovery (Spirtes et al., 2000) , empirical finance (Ledoit & Wolf, 2008; Kim & Ji, 2015) , medical studies (Kassirer, 1983) and genomics (Wilson et al., 1999; Dudoit et al., 2003) . Because the observational data used in the correlation tests may contain sensitive information such as genomic information, and collecting participants' information to a central repository poses a significant privacy risk. To address this problem, we utilize the federated setting, where each client maintains its own data and communicates with a central server to calculate a function. The communication transcript should contain as little information as feasible to prevent the server from inferring sensitive information. To motivate our work and ease the understanding of the problem setting, we consider a medical company that wants to study the correlation between genetic defects and races using the patients' private data from several hospitals. For a traditional method in the federated setting, the server, which is the medical company, will aggregate the hospitals' local private contingency tablesfoot_0 using secure aggregation (Bonawitz et al., 2017; Bell et al., 2020) . The company can conduct correlation tests with the aggregated global contingency table without directly accessing the individual hospitals' private data. Attentive readers might be aware that the method mentioned above leaks the joint distribution, which is the whole global contingency table, to the server. The joint distribution may contain sensitive information, and leaking it will probably violate privacy regulations. For instance, the medical company can observe the genetic distribution across races from the global table. The secure aggregation primarily supports linear aggregation. However, in correlation tests, the computation involves computing a summed p-th moment over the aggregated data, where p ∈ (0, 1) ∪ (1, 2]. Thus, the joint distribution will be leaked if we directly apply secure aggregation. To bridge the gap between secure aggregation and federated correlation tests, we take an important step towards designing non-linear secure aggregation protocols. Specifically, we design a federated protocol framework, namely FED-COR, optimized for a class of correlation tests, such as χ 2 -test and G-test. FED-COR is designed to have low computation and communication costs and only disclose information that is much less sensitive than the joint distribution. Our first insight is to recast correlation tests as frequency moment estimation problems. To approximate the frequency moments in a federated manner, each client collaborates with the other clients to generate a projection matrix and encodes its raw data into a low-dimensional vector via stable random projection (Indyk, 2006; Vempala, 2005; Li, 2008) . Such encodings can be aggregated with only summation, allowing clients to leverage secure aggregation to aggregate the encodings. The server then decodes the aggregated encoding to approximate the frequency moments. As secure aggregation conceals each client's individual update within the aggregated global update, the server learns only necessary information for the correlation test. To illustrate the power of FED-COR, we instantiate it with a representative correlation test, namely Pearson's χ 2 -test (Pearson, 1900) and refer to the concrete protocol as FED-χ 2 . We evaluate FEDχ 2 on 4 synthetic datasets and 16 real-world datasets. The results show that FED-χ 2 can replace centralized correlation tests with good accuracy. Compared to the traditional method with secure aggregation mentioned above, FED-χ 2 saves a factor of O(m) communication cost per client, where m is the size of the contingency tables. In FED-χ 2 , clients only need to upload a low-dimensional encoding with size ℓ ≪ m, while in the traditional method the clients will upload the complete contingency tables. Additionally, we analyze FED-χ 2 in two real-world use cases: feature selection and online false discovery rate control. The results show that FED-χ 2 can achieve comparable performance with centralized correlation tests and can withstand up to 20% of clients dropping out with only minor influence on the accuracy. Besides Pearson's χ 2 -test, we also demonstrate how to accommodate other commonly used correlation tests such as G-test in FED-COR. In summary, we make the following contributions: • We propose FED-COR, the first secure federated correlation test framework. FED-COR is computation-and communication-efficient and leaks much less information than directly using secure aggregation to collect the contingency table, which completely leaks the joint distribution. • FED-COR decomposes correlation test into frequency moments estimation that can easily be encoded/decoded using stable projection and secure aggregation techniques. We provide formal security proof and utility analysis of the protocol. • We demonstrate how to accommodate χ 2 -test and G-test in FED-COR, and empirically evaluate FED-χ 2 in several real-world use cases. The findings suggest that FED-χ 2 can substitute centralized χ 2 -test with comparable accuracy. Besides, FED-χ 2 can tolerate up to 20% of clients dropout with minor accuracy drop. We provide the code in the supplementary material for results verification.

2. RELATED WORK

There have been a line of works studying secure federated learning or statistics. Bonawitz et al. (2017) proposed the well-quoted secure aggregation protocol as a low-cost way to securely calculate linear functions in a federated setting. It has seen many variants and improvements since then. 

3. METHODOLOGY

In this section, we elaborate on the design of FED-COR. Sec. 3.1 formalizes the problem, establishes the notation system, and introduces the threat model. In Sec. 3.2, we detail the design of FED-COR



Contingency table contains the frequency distribution of the variables; see(Wikipedia, 2021).



For instance, Truex et al. (2019) and Xu et al. (2019) employed advanced crypto tools for secure aggregation, such as threshold homomorphic encryption and functional encryption. So et al. (2021) proposed TURBOAGG, which combines secure sharing with erasure codes for better dropout tolerance. To improve communication efficiency, Bell et al. (2020) and Choi et al. (2020) replaced the complete graph in secure aggregation with either a sparse random graph or a low-degree graph. Secure aggregation is deployed in a variety of applications. Agarwal et al. (2018) added binomial noise to local gradients, resulting in both differential privacy and communication efficiency. Wang et al. (2020) replaced the binomial noise with discrete Gaussian noise, which is shown to exhibit better composability. Kairouz et al. (2021) proved that the sum of discrete Gaussian is close to discrete Gaussian, thus discarding the common random seed assumption from Wang et al. (2020). The above three works all incorporate secure aggregation in their protocols to lower the noise scale required for differential privacy. Chen et al. (2020) added an extra public parameter to each client to force them to train in the same way, allowing for the detection of malicious clients during aggregation. Nevertheless, designing secure federated correlation tests, despite its importance in real-world scenarios, is not explored by existing research in this field.

