DATA LEAKAGE IN TABULAR FEDERATED LEARNING

Abstract

While federated learning (FL) promises to preserve privacy in distributed training of deep learning models, recent work in the image and NLP domains showed that training updates leak private data of participating clients. At the same time, most high-stakes applications of FL (e.g., legal and financial) use tabular data. Compared to the NLP and image domains, reconstruction of tabular data poses several unique challenges: (i) categorical features introduce a significantly more difficult mixed discrete-continuous optimization problem, (ii) the mix of categorical and continuous features causes high variance in the final reconstructions, and (iii) structured data makes it difficult for the adversary to judge reconstruction quality. In this work, we tackle these challenges and propose the first comprehensive reconstruction attack on tabular data, called TabLeak. TabLeak is based on three key ingredients: (i) a softmax structural prior, implicitly converting the mixed discrete-continuous optimization problem into an easier fully continuous one, (ii) a way to reduce the variance of our reconstructions through a pooled ensembling scheme exploiting the structure of tabular data, and (iii) an entropy measure which can successfully assess reconstruction quality. Our experimental evaluation demonstrates the effectiveness of TabLeak, reaching a state-of-the-art on four popular tabular datasets. For instance, on the Adult dataset, we improve attack accuracy by 10% compared to the baseline on the practically relevant batch size of 32 and further obtain non-trivial reconstructions for batch sizes as large as 128. Our findings are important as they show that performing FL on tabular data, which often poses high privacy risks, is highly vulnerable.

1. INTRODUCTION

Federated Learning (McMahan et al., 2016) (FL) has emerged as the most prominent approach to training machine learning models collaboratively without requiring sensitive data of different parties to be sent to a single centralized location. While prior work has examined privacy leakage in federated learning in the context of computer vision (Zhu et al., 2019; Geiping et al., 2020; Yin et al., 2021) and natural language processing (Dimitrov et al., 2022a; Gupta et al., 2022; Deng et al., 2021) , many applications of FL rely on large tabular datasets that include highly sensitive personal data such as financial information and health status (Borisov et al., 2021; Rieke et al., 2020; Long et al., 2021) . However, no prior work has studied the issue of privacy leakage in the context of tabular data, a cause of concern for public institutions which have recently launched a competitionfoot_0 with a 1.6 mil. USD prize to develop privacy-preserving FL solutions for fraud detection and infection risk prediction, both being tabular datasets. Key challenges Leakage attacks often rely on solving optimization problems whose solutions are the desired sensitive data points. Unlike other data types, tabular data poses unique challenges to solving these problems because: (i) the reconstruction is a solution to a mixed discrete-continuous optimization problem, in contrast to other domains where the problem is either fully continuous or discrete (pixels for images and tokens for text), (ii) there is high variance in the final reconstructions because, uniquely to tabular data, discrete changes in the categorical features significantly change the optimization trajectory, and (iii) assessing the quality of reconstructions is harder compared to images and text -e.g. determining whether a person with given reconstructed characteristics exists is difficult. Together, these challenges imply that it is difficult to make existing attacks work on tabular data. To address the first challenge of tabular data leakage, we transform the mixed discretecontinuous optimization problem into a fully continuous one, by passing our current reconstructions z t 1 , . . . , z t N through a per-feature softmax σ at every step t. Using the softmaxed data σ(z t ), we take a gradient step to minimize the reconstruction loss, which compares the received client update ∇f with a simulated client update computed on σ(z t ). In Step 2, we reduce the variance of the final reconstruction by performing pooling over the N different solutions z 1 , z 2 , ..., z N , thus tackling the second challenge. In Step 3, we address the challenge of assessing the fidelity of our reconstructions. We rely on the observation that often when our proposed reconstructions z 1 , z 2 , ..., z N agree they also match the true client data, c(x). We measure the agreement using entropy. In the example above, we see that the features sex and age produced a low entropy distribution. Therefore we assign high confidence to these results (green arrows). In contrast, the reconstruction of the feature race receives a low confidence rating (orange arrow); rightfully so, as the reconstruction is incorrect. We implemented our approach in an end-to-end attack called TabLeak and evaluated it on several tabular datasets. Our attack is highly effective: it can obtain non-trivial reconstructions for batch sizes as large as 128, and on many practically relevant batch sizes such as 32, it improved reconstruction accuracy by up to 10% compared to the baseline. Overall, our findings show that FL is highly vulnerable when applied to tabular data.

Main contributions Our main contributions are:

• Novel insights enabling efficient attacks on FL with tabular data: using softmax to make the optimization problem fully continuous, ensembling to reduce the variance, and entropy to assess the reconstructions. • An implementation of our approach into an end-to-end tool called TabLeak. • Extensive experimental evaluation, demonstrating effectiveness of TabLeak at reconstructing sensitive client data on several popular tabular datasets.



https://petsprizechallenges.com/



Figure 1: Overview of TabLeak. Our approach transforms the optimization problem into a fully continuous one by optimizing continuous versions of the discrete features, obtained by applying softmax (Step 1, middle boxes), resulting in N candidate solutions (Step 1, bottom). Then, we pool together an ensemble of N different solutions z 1 , z 2 , ..., z N obtained from the optimization to reduce the variance of the reconstruction (Step 2). Finally, we assess the quality of the reconstruction by computing the entropy from the feature distributions in the ensemble (Step 3).

