INDIVIDUAL FAIRNESS OF DATA PROVIDER REGARDING PRIVACY RISK AND GAIN Anonymous authors Paper under double-blind review

Abstract

Fairness and privacy risks are important concerns of machine learning (ML) when deploying ML to the real world. Recent studies have focused on group fairness and privacy protection, but no study focuses on individual fairness (IF) and privacy protection. In this paper, we propose a new definition of IF from the perspective of privacy protection and experimentally evaluate privacy-preserving ML based on the proposed IF. For the proposed definition, we assume that users provide their data to an ML service and consider the principle that all users should obtain gains corresponding to their privacy risks. As a user's gain, we calculate the accuracy improvement on the user's data when providing the data to the ML service. We conducted experiments on the image and tabular datasets using three neural networks (NNs) and two tree-based algorithms with differential privacy guarantee. The experimental results of NNs show that we cannot stably improve the proposed IF by changing the strength of privacy protection and applying defenses against membership inference attacks. The results of tree-based algorithms show that privacy risks were extremely small without depending on the strength of privacy protection but raise a new question about the motivation of users for providing their data.

1. INTRODUCTION

As machine learning (ML) services trained with users' data become increasingly popular, privacy risks of memorizing training data have been gaining attention (Shokri et al., 2017; Jagielski et al., 2020; Nasr et al., 2021; Malek Esmaeili et al., 2021) . To prevent privacy leakage through trained models, privacy-preserving ML based on differential privacy (DP) (Dwork et al., 2006 ) is a de facto standard. For example, DP-SGD (Song et al., 2013; Abadi et al., 2016) is used for training neural networks (NNs) based on stochastic gradient descend (SGD) with DP guarantee, and DPBoost (Li et al., 2020) and DPXGBoost (Grislain & Gonzalvez, 2021) are used for training tree-based models with DP guarantee. When applying ML to the real world, fairness is another important concern about ML. Recent studies have begun to focus on both privacy protection and fairness: the difference in the effect of DP on majority and minority groups (Bagdasaryan et al., 2019; Pujol et al., 2020; Farrand et al., 2020; Tran et al., 2021) , the difference in vulnerabilities against membership inference attacks (MIAs) between majority and minority groups (Zhang et al., 2020; Zhong et al., 2022) , and methods for guaranteeing both group fairness and DP (Xu et al., 2019; 2020) . All of these studies have focused on group fairness, i.e., fairness between majority and minority groups. Assuming situations where users decide whether to provide their data to ML services, individual fairness (IF), i.e., fairness between individual users, is also important for the decision. However, no study has focused on IF and privacy protection. In this paper, we investigate privacy-preserving ML from the perspective of both IF and privacy protection. To this end, we propose a new definition of IF from the perspective of privacy protection and experimentally evaluate privacy-preserving ML based on the proposed IF. Assuming that users provide their data to an ML service, we define the proposed IF based on the principle that all users should obtain gains corresponding to their privacy risks. Furthermore, we discuss the relationship between the proposed IF and prior IF for classification and validate the proposed IF using synthetic data. We extensively evaluate privacy-preserving ML in terms of the proposed IF. Using two image datasets, we evaluate a six-layer convolutional NN (CNN) and ResNet18 (He et al., 2016) trained with DP-SGD (Song et al., 2013; Abadi et al., 2016) . Using two tabular datasets, we evaluate a fivelayer fully connected NN trained with DP-SGD, DPBoost (Li et al., 2020) , and DPXGBoost (Grislain & Gonzalvez, 2021). In the evaluation, as a user's privacy risk, we calculate a lower bound of a DP parameter ϵ (Jagielski et al., 2020; Malek Esmaeili et al., 2021) . As a user's gain, we calculate the accuracy improvement on the user's data when providing the data to the ML service. Since the accuracy improvement means that the utility of the ML service increases for the user, we can regard the accuracy improvement, i.e., the utility increase, as the user's gain. The results were different for NNs and tree-based algorithms. The main findings are as follows. • The results of NNs show that unfairness in terms of the proposed IF was large without depending on the strength of privacy protection because some users' gains were small compared with their privacy risks. These results show that we cannot improve the proposed IF by adjusting the strength of privacy protection. • We further evaluated the proposed IF when applying defenses against MIA to NNs. No defense improved fairness without depending on the settings (i.e., datasets, NNs, and strength of privacy protection), and fairness was degraded by the defenses in some settings. These results show the need for a method that stably improves the proposed IF of NNs without depending on the settings. • The results of tree-based algorithms show that privacy risks and gains were extremely small without depending on the strength of privacy protection. For tree-based algorithms, the proposed IF does not seem to be important, but these results raise a new question about the motivation of users for providing their data. For example, some users are unwilling to provide their data if their data do not improve the ML service.

2. PRELIMINARIES

Individual fairness. IF is a main concept of algorithmic fairness along with group fairness. IF is proposed for the classification task based on the principle "similar data should be classified similarly" (Dwork et al., 2012) . Let input space be V , a set of output classes be A, probability distributions over output classes be ∆(A), a mapping from an input to an output, i.e., an ML model, be M : V → ∆(A), and distance in input and output space be d : Note that in practice, a task-specific loss needs to be considered in addition to these definitions of IF for building a fair and accurate model.  V × V → R



and D : ∆(A) × ∆(A) → R. If an model is a Lipschitz mapping, the model satisfies the principle of IF. Definition 1 (Lipschitz mapping). A mapping M : V → ∆(A) satisfies the (D, d)-Lipschitz property if for any x, y ∈ V , the following holds: D(M (x), M (y)) ≤ d(x, y).d and D need to be designed for each task. An example of d is a Mahalanobis distance without using features correlated with sensitive attributes such as races and genders. Another definition based on the same principle is proposed by relaxing the Lipschitz property. Definition 2 (ϵ-δ-IF (John et al., 2020)). A mapping M is ϵ-δ-individually fair if for all x, y such that d(x, y) ≤ ϵ, the following holds: |M (x) -M (y)| ≤ δ.

Differential privacy. DP(Dwork et al., 2006) is a standard definition of privacy protection for statistical data analysis. In DP, we consider neighboring datasets D 0 and D 1 differing by only one sample. An example is adding one sample (x ′ , y ′ ) to D 0 to make D 1 , i.e., D 1 = D 0 ∪ {(x ′ , y ′ )}. Definition 3 (Differential Privacy). A randomized mechanism M : D → R is (ϵ, δ)-differentially private if for any neighboring datasets D 0 , D 1 and for any output range S ⊂ R, the following holds: P r[M(D 0 ) ∈ S] ≤ e ϵ P r[M(D 1 ) ∈ S] + δ.

