FED-COR: FEDERATED CORRELATION TEST WITH SE-CURE AGGREGATION

Abstract

In this paper, we propose the first federated correlation test framework compatible with secure aggregation, namely FED-COR. In FED-COR, correlation tests are recast as frequency moment estimation problems. To estimate the frequency moments, the clients collaboratively generate a shared projection matrix and then use stable projection to encode the local information in a compact vector. As such encodings can be linearly aggregated, secure aggregation can be applied to conceal the individual updates. We formally establish the security guarantee of FED-COR by proving that only the minimum necessary information (i.e., the correlation statistics) is revealed to the server. The evaluation results show that FED-COR achieves good accuracy with small client-side computation overhead and performs comparably to the centralized correlation test in several real-world case studies.

1. INTRODUCTION

Correlation test, as the name implies, is the process of examining the correlation between two random variables using observational data. It is a fundamental building block in a wide variety of real-world applications, including feature selection (Zheng et al., 2004 ), cryptanalysis (Nyberg, 2001) , causal graph discovery (Spirtes et al., 2000) , empirical finance (Ledoit & Wolf, 2008; Kim & Ji, 2015) , medical studies (Kassirer, 1983) and genomics (Wilson et al., 1999; Dudoit et al., 2003) . Because the observational data used in the correlation tests may contain sensitive information such as genomic information, and collecting participants' information to a central repository poses a significant privacy risk. To address this problem, we utilize the federated setting, where each client maintains its own data and communicates with a central server to calculate a function. The communication transcript should contain as little information as feasible to prevent the server from inferring sensitive information. To motivate our work and ease the understanding of the problem setting, we consider a medical company that wants to study the correlation between genetic defects and races using the patients' private data from several hospitals. For a traditional method in the federated setting, the server, which is the medical company, will aggregate the hospitals' local private contingency tablesfoot_0 using secure aggregation (Bonawitz et al., 2017; Bell et al., 2020) . The company can conduct correlation tests with the aggregated global contingency table without directly accessing the individual hospitals' private data. Attentive readers might be aware that the method mentioned above leaks the joint distribution, which is the whole global contingency table, to the server. The joint distribution may contain sensitive information, and leaking it will probably violate privacy regulations. For instance, the medical company can observe the genetic distribution across races from the global table. The secure aggregation primarily supports linear aggregation. However, in correlation tests, the computation involves computing a summed p-th moment over the aggregated data, where p ∈ (0, 1) ∪ (1, 2]. Thus, the joint distribution will be leaked if we directly apply secure aggregation. To bridge the gap between secure aggregation and federated correlation tests, we take an important step towards designing non-linear secure aggregation protocols. Specifically, we design a federated protocol framework, namely FED-COR, optimized for a class of correlation tests, such as χ 2 -test and G-test. FED-COR is designed to have low computation and communication costs and only disclose information that is much less sensitive than the joint distribution. Our first insight is to recast correlation tests as frequency moment estimation problems. To approximate the frequency moments



Contingency table contains the frequency distribution of the variables; see(Wikipedia, 2021).1

