INSTANCE-WISE BATCH LABEL RESTORATION VIA GRADIENTS IN FEDERATED LEARNING

Abstract

Gradient inversion attacks have posed a serious threat to the privacy of federated learning. The attacks search for the optimal pair of input and label best matching the shared gradients and the search space of the attacks can be reduced by pre-restoring labels. Recently, label restoration technique allows for the extraction of labels from gradients analytically, but even the state-of-the-art remains limited to identify the presence of categories (i.e., the class-wise label restoration). This work considers the more real-world settings, where there are multiple instances of each class in a training batch. An analytic method is proposed to perform instance-wise batch label restoration from only the gradient of the final layer. On the basis of the approximate recovered class-wise embeddings and postsoftmax probabilities, we establish linear equations of the gradients, probabilities and labels to derive the Number of Instances (NoI) per class by the Moore-Penrose pseudoinverse algorithm. Untrained models are most vulnerable to the proposed attack, and therefore serve as the primary experimental setup. Our experimental evaluations reach over 99% Label existence Accuracy (LeAcc) and exceed 96% Label number Accuracy (LnAcc) in most cases on three image datasets and four untrained classification models. The two metrics are used to evaluate class-wise and instance-wise label restoration accuracy, respectively. And the recovery is made feasible even with a batch size of 4096 and partially negative activations (e.g., Leaky ReLU and Swish). Furthermore, we demonstrate that our method facilitates the existing gradient inversion attacks by exploiting the recovered labels, with an increase of 6-7 in PSNR on both MNIST and CIFAR100. Our code is available at https://github.com/BUAA-CST/iLRG.

1. INTRODUCTION

Federated Learning (FL) is one of the most popular distributed learning paradigms to achieve privacy preservation and has attracted widespread attention (Jochems et al., 2016; McMahan et al., 2017; Yang et al., 2019a) , especially in privacy-sensitive fields such as healthcare (Brisimi et al., 2018; Sadilek et al., 2021) and finance (Yang et al., 2019b; Long et al., 2020) . FL requires participants to communicate the gradients or weight updates instead of private data on a central server, in principle, offering sufficient privacy protection. Contrary to prior belief, recent works have demonstrated that shared gradients can still leak sensitive information in FL. Multiple attack strategies preliminarily look into the issue despite their own limitations. For instance, Membership Inference (Shokri et al., 2017) allows the adversary to determine whether an existing data sample is involved in the training set. Property Inference (Melis et al., 2019) analogously retrieves certain attributes (e.g., people's race and gender in the training set). Model Inversion (Fredrikson et al., 2015) utilizes the GAN (Goodfellow et al., 2014) model to generate visual alternatives that look similar but are not the original data. An emerging research, i.e., Deep Leakage from Gradients (DLG) (Zhu et al., 2019) , has showed the possibility of fully recovering input data given gradients in a process now known as gradient inversion. This approach primarily rely on the gradient matching objective to perform the attack, i.e., optimizing a dummy input by minimizing the distance between its gradient to a target gradient sent from a client. Furthermore, it enables pixel-wise accurate reconstruction on image classification tasks, and soon scales to deeper networks and larger-resolution images in a mini-batch (Geiping et al., 2020; Yin et al., 2021; Jeon et al., 2021; Li et al., 2022) . Gradient inversion jointly recovers private inputs and labels, whereas most works focus more on restoration of training samples. Label restoration is non-trivial, and it has been shown to be critical to high-quality data reconstruction. The optimization-based method for label recovery is not guaranteed to succeed as well. Analytical extraction of the ground truth label from gradients produced by a single sample (Zhao et al., 2020) was first made possible through an innovative observation of the gradient sign. Along this line, follow-ups (Yin et al., 2021; Dang et al., 2021) extend this method to the recovery of labels in a mini-batch with a high success rate. Despite remarkable progress, such attacks are applicable only when no two inputs in the batch belong to the same class. In realworld scenarios, there are multiple instances of each category, which means existing attacks can only determine which classes of samples are present at best. Moreover, most prior methods assume that the target model employs a non-negative activation function, and they fail to handle activation functions that might produce negative elements (e.g., Leaky ReLU (Maas et al., 2013) and Swish (Ramachandran et al., 2017) in EfficientNet (Tan & Le, 2019) ). To this end, our work aims to identify the ground-truth label for each data point in one batch (i.e., instance-wise labels). Firstly, we are able to recover class-wise averaged inputs (i.e., embeddings) to the final layer from its gradients with relative fidelity. The same approach can be proved perfectly correct in the single-sample case (Geiping et al., 2020) , but here we require two empirical properties (Sun et al., 2021) to make three approximations: intra-class uniformity and concentration of embedding distribution and inter-class low entanglement of gradient contributions. Then we derive that the gradient w.r.t. the network final logits equals the difference between the post-softmax probability and the binary representation of the label, and expand it to a system of equations consisting of the gradients, probabilities and labels, where all three variables can be substituted or decomposed equivalently. We finally obtain a Least Square Solution (LSS) of the Number of Instances (NoI) per class by the Moore-Penrose pseudoinverse algorithm. Our main contributions are as follows: • We propose an analytic method to recover the exact labels that a client possesses via batchaveraged gradients in FL. Following the approximate restoration of class-wise embeddings and post-softmax probabilities, a system of linear equations with the gradients, probabilities and labels is established to derive the NoI of each class by the Moore-Penrose pseudoinverse algorithm. • Our method outperforms prior attacks and poses a greater threat to the untrained model. In this case, it works on large batch sizes of up to 4096, and handles classification models with partially negative activations as deep as ResNet-152 (He et al., 2016) . • We demonstrate that gradient inversion attack can be combined with our batch label restoration to further improve its performance due to the reduced search space during optimization.

2.1. PROBLEM FORMULATION

Given a network with weights W and the batch-averaged gradient ∇W calculated from a batch of sample-label pairs, we expect to reveal instance-wise labels y via gradients. For each pair (x, y), we denote the embedding vector into the final layer as e ∈ R m , the network final logits as z ∈ R C and the post-softmax probability as p ∈ R C in range (0, 1), where m is the embedding dimension, C is the number of classes and y here is the one-hot binary representation of the same shape as z. In the following sections, W ∈ R C×m and b ∈ R C refer to the weight and bias of the final classification layer, respectively. Then we have z = We + b and p = Sof tM ax(z). at least one fully-connected layer and have a softmax activation with cross-entropy loss for classification, such as fully-connected neural networks (FCN) and convolutional neural networks from shallow to deep--LeNet-5 (LeCun et al., 1998) , VGG (Simonyan & Zisserman, 2014) , ResNet (He et al., 2016) , etc.

2.2. ANALYTIC LABEL RESTORATIONS

Prior analytical label restoration methods primarily rely on a major observation. For a single sample, the derivative of the cross-entropy loss L w.r.t. the network final logit z at index i is ∇z i = p i -y i (See Appendix A for detailed derivation). This, obviously, leads to a unique negative sign for ∇z i at the ground truth index c. However, we are able to access ∇W i instead of ∇z i . While using the chain rule, we have ∇W i = ∇z i × ∇ Wi z i = ∇z i e ⊤ . As the embedding e is independent of the class index i, the uniqueness of ∇z i 's sign is passed on to ∇W i . Formally, we have ∇W i • ∇W j = ||e|| 2 ∇z i ∇z j , ∇z c < 0, and ∇z i̸ =c > 0, such that the label c can be identified by inspecting whether ∇W i •∇W j ≤ 0, ∀j ̸ = i. Because of the common use of non-negative activation functions (ensure that ||e|| > 0), e.g., ReLU (Glorot et al., 2011) and Sigmoid, we can simply extract the ground-truth label whose ∇W i is negative. This is exactly what iDLG (Zhao et al., 2020) does. Subsequently, the researchers (Yin et al., 2021) observe empirically the magnitude of the negative gradient significantly exceeds that of the positive gradient assuming a non-negative non-linear activation function is applied, which indicates a negative sign can still stand out after averaging operations. Therefore, they search for negative signs using minimum values rather than summation along the feature dimension to perform label restoration from batch-averaged gradients. However, without the assumption that ||e|| > 0, neither approach is viable any longer. Another batch label restoration method, Revealing Labels from Gradients (RLG) (Dang et al., 2021) , offers a novel insight thus no need to satisfy that assumption. ∇W ⊤ can be decomposed into PΣQ by singular value decomposition, where P ∈ R m×S and Q ∈ R S×C are orthogonal matrices, Σ ∈ R S×S in the middle is a diagonal matrix with non-negative elements on the diagonal, and S = rank(∇W ⊤ ) < min{m, C}. If we denote r = ∇z i Q ⊤ , then rQ = ∇z i , which means rq c < 0 and rq j̸ =c > 0 (q j is the j-th column in Q). The problem of label recovery is then transformed into finding a classifier to separate q c from q j̸ =c by linear programming. Moreover, ∇W can also be disassembled: ∇W = 1 K K j=1 ∇z j e ⊤ j = ZE ⊤ , where Z = [∇z 1 , ..., ∇z K ] ∈ R C×K , E = 1 K [e 1 , . .., e K ] ∈ R m×K and K is the batch size. Assume that Z and E are full-rank matrices, we have K = S < min{m, C}, so the approach requires K not to be large.

2.3. SINGLE EMBEDDING RECONSTRUCTION

In deep neural network architectures, the fully-connected layer is more vulnerable to leakage from gradients for its simple design. A recent work, i.e., InvertingGradients (IG) (Geiping et al., 2020) , has brought theoretical insights on this task by showing provable embedding reconstruction feasibility. Theorem 1. For neural networks with a biased fully-connected layer (e.g., the final classification layer), presume the derivative of the loss L w.r.t. to the layer's output z contains at least one nonzero element, then the input to the fully-connected layers e can be uniquely reconstructed by analytic computation. Proof. Consider the mapping z = We + b of a biased full-connected layer without a nonlinear activation, it's easy to observe that ∂z ∂We = ∂z ∂b = 1 ∈ R C . Due to our assumption guarantees ∂L ∂z i ̸ = 0 for some index i, we have ∂L ∂b i = ∂L ∂z i × ∂z i ∂b i = ∂L ∂z i × 1 = ∂L ∂z i and ∂L ∂W i = ∂L ∂z i × ∂z i ∂W i = ∂L ∂z i e ⊤ = ∂L ∂b i e ⊤ according to the chain rule. Therefore, e can be calculated exactly as e = ( ∂L ∂b i ) -1 ( ∂L ∂W i ) ⊤ . On the basis of Theorem 1, we are able to perfectly accomplish the analytic reconstruction of a single input to a fully-connected layer. Such a one-shot theoretical approach, however, cannot be extended directly to recover batch embeddings for unignorable information loss from the average operation.

3. METHODOLOGY

In this section, we propose a method to restore instance-wise labels via batch-averaged gradients, which we refer to as instance-wise Labels Restoration from Gradients (iLRG). Our method consists of three main steps as shown in Fig1 (b): (1) Reconstruct the class-wise embeddings by calculating the quotient of two gradients of the weight and bias in the final classification layer; (2) Feed the embeddings into this layer to obtain the subsequent post-softmax probabilities; (3) Solve a system of linear equations for the number of instances per class by the Moore-Penrose pseudoinverse algorithm.

3.1. CLASS-WISE EMBEDDING RECONSTRUCTION

Two crucial observations presented in Soteria (Sun et al., 2021) push the embedding reconstruction towards deeper: (1) intra-class uniformity and concentration of embedding distribution; (2) inter-class low entanglement of gradient contributions, an empirical observation that i-class samples mainly contribute to the i-th gradient row. Leveraging them, we can attempt to reconstruct the average embeddings of each class, i.e., class-wise embeddings. We first divide a training batch B into subsets of C distinct classes, i.e., B = {B 1 , ..., B C }. Then, based on the observations mentioned above, we make two approximations. The first approximation has been formalized in the Appendix material of Soteria. We then further propose a theoretical account of the latter. Approx 1 (Intra-class Uniformity and Concentration of Embedding Distribution). The distribution of embeddings e is uniform and concentrated over a certain class of samples B i in a training batch, such we can replace them with the arithmetic mean of this categoty, i.e., the geometric center. Consequently, the average gradient at index i over B i can be represented as formula 1 with Theorem 1: ∂L ∂W i Bi = 1 |B i | j∈Bi ∂L j ∂W j i = 1 |B i | j∈Bi ∂L j ∂b j i e j ⊤ ≈ ( 1 |B i | j∈Bi ∂L j ∂b j i )( 1 |B i | j∈Bi e j ⊤ ) = ∂L ∂b i Bi e ⊤ Bi . (1) where (•) Bi and (•) j i denote the mean of a variable and the variable at index i for sample j across B i , respectively. Untrained models are obviously more likely to satisfy this property. If we project the embeddings onto a 2D plane, their distribution over the entire batch is close to a uniform circle due to the rather poor classification ability. We discover that any embedding will yield a nearly uniform 1/n probability output at the beginning of training, where n is the number of classes. When the model is well-trained, although the various categories can be well separated, the internal distribution of a certain category may not be sufficiently uniform and symmetrical, and there will be some outliers. We already know that ∂L ∂b i = ∂L ∂z i = p i -y i , which means the gradient only goes negative at the ground-truth class index c, and C i=1 ∂L ∂b i = C i=1 (p i -y i ) = 0. In other words, the av (absolute value) of the negative gradient at class index c is equal to the sum of avs of the other positive gradients. Based on this derivation, another approximation is proposed. Approx 2 (Inter-class Low Entanglement of Gradient Contributions).The batch-averaged gradient row at index i is mainly from i-class samples in a training batch. Specifically, we have: ∂L ∂b i B = 1 |B| C j=1 |B j | ∂L ∂b i Bj ≈ |B i | |B| ∂L ∂b i Bi , ∂L ∂W i B = 1 |B| C j=1 |B j | ∂L ∂W i Bj ≈ |B i | |B| ∂L ∂W i Bi . (2) Since ∂L ∂W i = ∂L ∂b i e ⊤ , the latter in formula 2 requires the variance of ||e|| over the whole batch to be smaller than the proportionality of the bias gradient. Due to the commonly used Input Normalization and Batch Normalization operations, this requirement is not difficult to meet, especially for untrained models. According to our derivation that ∇b i = ∇z i = p i -y i , the bias gradient of i-class sample at index i is p i i -1, while that of another class j is p j i . Here the superscripts correspond to the categories. Therefore, Approx 2 holds when |p j i | ≪ |p i i -1| for any j ̸ = i. Hence, the entanglement is related to the label distribution, i.e., it should not be extreme disparate. Furthermore, the magnitude of the gradients will be significantly reduced and get more sensitive to errors as training progresses. On the basis of Approx 2, we can derive that ∂L ∂b i -1 Bi × ∂L ∂W i ⊤ Bi ≈ ∂L ∂b i -1 B × ∂L ∂W i ⊤ B . Of course, it is necessary to ensure that ∂L ∂b i B ̸ = 0 and ∂L ∂b i Bi ̸ = 0 here. The occurrence of zero is uncommon, but when it does occur, we replace it with a small enough number ϵ. Combined with formula 1, we finally get e Bi ≈ ∂L ∂b i -1 B × ∂L ∂W i ⊤ B to reconstruct the class-wise embeddings. Bypass Approx 2 when untrained. In Approx 2, if the error terms for weight and bias gradients are not ignored, we have: ∂L ∂b i B = 1 |B| (|B i | ∂L ∂b i Bi + j̸ =i |B j | ∂L ∂b i Bj ), ∂L ∂W i B = 1 |B| (|B i | ∂L ∂W i Bi + j̸ =i |B j | ∂L ∂W i Bj ). (3) For an untrained model, the average e over any B i is almost equal. As a result, we have: ∂L ∂W i B ≈ 1 |B| (|B i | ∂L ∂b i Bi + j̸ =i |B j | ∂L ∂b i Bj )e ⊤ . Since the restored embedding is the quotient of the two formula 4 and the former of 3, the error terms hardly affect the proportional result in this case, which means we can bypass Approx 2. Taking these two approximations together, we assert that our attacks on untrained models outperform trained models. The improvement of our work over Soteria is that they just restore a linearly scale embedding γ∇W i while ours is more detailed and precise, where γ is a scale influences by the local training steps (γ = 1 in FedSGD and γ > 1 in FedAVG (McMahan et al., 2016) ) and ∇W i is a brief notation for ∂L ∂W i B . The reason is that they don't apply Theorem 1 to make use of ∇b i .

3.2. INSTANCE-WISE LABEL RESTORATION

If we extend ∇z i = p i -y i for single sample to the whole batch, we can get k ∂L k ∂z k i = k p k i - k y k i , where (•) k i denotes the variable at index i for sample k in a batch of size K. After adjusting the order, it becomes k p k i - k ∂L k ∂z k i = k y k i . The right of this equation is exactly the total number of instances of class i in a batch. Let k j denotes the number of j-class instances, and we can disassemble k p k i into j k j p iB j . From the class-wise embeddings that was previously deduced, it is possible to recover p Bj , again based on the intra-class uniformity and concentration of embedding distribution. Approx 3 (Average Probabilities from Average Embeddings). The average post-softmax probability from j-class samples by classification model with softmax activation is approximate to that produced by j-class average embedding. p Bj = 1 |B j | t∈Bj Sof tM ax(We + b) ≈ Sof tM ax(We Bj + b). (5) Additionally, we have k ∂L k ∂z k i = K ∂L ∂z i B = K ∂L ∂b i B = K∇b i , where K is the batch size. Therefore, we arrive to an equation: j k j p iB j -K∇b i = k i , i = 1, ..., C, j = 1, ..., C. This is equivalent to a system of equations as shown in (7):                    k 1 + ... + k i + ... + k C = K, j k j p 1B j -K∇b 1 = k 1 , j k j p 2B j -K∇b 2 = k 2 , . . . j k j p C Bj -K∇b C = k C . Since the coefficient matrix of this system is not square, we adopt the Moore-Penrose pseudo-inverse algorithm to get a LSS. And the final result also requires filtering of outliers and rounding. If the occurences of the class-wise labels can be obtained in advance, we can further simplify the above equation system through the prior works (Zhao et al., 2020; Yin et al., 2021; Dang et al., 2021) , i.e., to discard unknowns k j and the corresponding equation where there is no j-class sample in the training batch.

4. EXPERIMENTS

Setups. We evaluate our method for the classification task on three classic image datasets with ascending classes and four models from shallow to deep: (1) a 3-layer FCN (Fully-Connected Network) and other variations on the MNIST dataset with 10-class grayscale images of size 28×28. (2) the 7-layer LeNet-5 on the CIFAR100 dataset with 100-classes color images of size 32×32. (3) the 16-layer VGG-16 and other variations on the large-scale 1000-class ImageNet ILSVRC 2012 dataset (Deng et al., 2009) The corresponding error analysis is given in Appendix D. Approx 1 is generally satisfied for both untrained and well-trained models, thus we speculate that the drop may be due to the larger error contributed by Approx 2.

4.2. COMPARISON WITH PRIOR WORKS

Attack Baselines. We compare our attack with prior analytic approaches: (1) Improved Deep Leakage from Gradients (iDLG) (Zhao et al., 2020) : the single-sample label inference by finding the gradient row whose dot product with other columns yields a negative value; (2) GradInversion (GI) (Yin et al., 2021) : the label restoration for the case of single-instance per class by the order of the minimum element in each gradient row; (3) Revealing Labels from Gradients(RLG) (Dang et al., 2021) : the extraction of a class-wise label set based on Singular Value Decomposition (SVD) and Linear Programming (LP). Owing to the limitations that iDLG and GI share, we scale them to the real-world settings of multiple instances per class in a batch for comparisons. For iDLG, we alter the selection of the gradient row with the smallest element after summation to all negative gradient rows. And we pick all negative gradients instead of the Top-K minimums in GI, where K is the batch size. Validity of Probability Reconstruction. The key to our approach is the estimation of the class-wise average probabilities. Our results of the last column in Table 1 show that recovery of the class-wise probability is quite precise and corroborates with the label restoration accuracies. We even achieve a 100% CosSim on both CIFAR100 and ImageNet datasets, and the essential cause behind the fact is that almost each class consists of at most two or three samples in a random-selected batch of size 24. Results of Label Restoration. Table 1 compares the performance of our proposed iLRG with several prior attack methods. First and foremost, we drill down the ability of the attack from determining the presence of labels to the number of instances for each class while the others are incapable of that. Our method achieves over 99.0% on the whole models and datasets for the two evaluation metrics here. In terms of LeAcc, GI performs best except our iLRG when a non-negative nonlinear activation function is utilized in the model. Nevertheless, once the activation in the network contains a negative element, its performance and that of iDLG will plummet. The essential cause of this fact has been expounded in Section 2.2. ) and CIFAR100 (ResNet-18, BS16) compared with IG (Geiping et al., 2020) . We assign a specific label to each instance after label restoration at 100% accuracy. The 6 best visual images are selected to display and calculate the metrics.

4.4. IMPROVED GRADIENT INVERSION ATTACK WITH OURS

Gradient inversion attacks perform joint optimization of model inputs and labels, thus labels may shift during optimization, which commonly leads to poor recovery of the inputs. Naturally, it occurs to us that the proposed method can be used to specify an optimization objective for each instance, so as to enhance the existing attacks. Since the adversary randomly initializes the dummy inputs of a batch as the optimization objectives in the gradient inversion attack, which is originally unordered, we can assign any label to each instance according to the instance-wise label restoration results and they will eventually produce exactly the same batch-averaged gradients. Therefore, for each category with at least one instance, we simply select a subset with the capacity of the number of instances of this class from the remaining randomly initialized dummy inputs and assign them labels of this category. Fig 4 illustrates the improvement both visually and numerically. We choose IG as the baseline because it does not require substantial prior constraints.

5. CONCLUSIONS

This work proposes instance-wise Label Restoration from Gradients (iLRG), a method to reveal instance-wise labels via shared batch-averaged gradients in FL. We build and solve a system of linear equations over the labels by leveraging a crucial derivation about the gradient of the network final logits and an approximate reconstruction of class-wise averaged probabilities. Our method performs extremely well for untrained models. We conduct comprehensive experiments on three classic image datasets with ascending classes and four models from shallow to deep (e.g., FCN on MNIST, LeNet-5 on CIFAR100, VGG and ResNet on ImageNet, etc) under the various settings of a large batch size up to 4096. The evaluations demonstrate the capability of iLRG with a high proportion of both Label existence Accuracy (LeAcc) and Label number Accuracy (LnAcc). It works even on models with an activation function that is not uniformly non-negative. Finally, We further facilitate the existing gradient inversion attacks by exploiting the recovered labels.  ∂p i ∂z j =        e zi C -e zi e zi C 2 = e zi C - e zi 2 C = p i -p i 2 , i = j, 0 × C -e zi e zj C 2 = - e zi C - e zj C = -p i p j , i ̸ = j, where i and j are both members of C = {1, ..., C}. And the cross-entropy loss can be formalized only if i = c otherwise 0. Therefore, using the chain rule, ∂L ∂z i can be calculated as the following formula: ∂L ∂z i = 0 + ∂L ∂p c × ∂p c ∂z i =      - 1 p c × (p c -p c 2 ) = p c -1, i = c, - 1 p c × (-p c × p i ) = p i , i ̸ = c. Merging the two branches, we get our conclusion that ∇z i = ∂L ∂z i = p i -y i .

B THE ARCHITECTURE OF FCN-3

See Table 2 . C PSEUDO-CODE OF OUR ALGORITHM Algorithm 1 provides a pseudo-code for the complete procedure of our method.

D APPROXIMATION ERROR ANALYSIS

For the experiment in Section 4.1, we conduct error analysis shown in the Feed e i into the final layer to get the network outputs z i and post-softmax probabilities p i ; 5: Add e i into E; 6: end for 7: Solve the system of linear equations in (7) to obtain N. 8: return E and N. Finally, we note that the recovered embeddings for untrained model are the best. This is because it satisfies both properties and can even bypass Approx 2. In addition, we can also assert from the above results that the error of Approx 2 may be larger than that of Approx1. Since Approx 2 directly participates in the division calculation for restoring embeddings, which intuitively may also be a reason. 

F EFFECT OF LABEL DISTRIBUTION

The inter-class entanglement of gradient contributions is significantly influenced by the label distribution. In an extremely imbalanced case, the gradients of a minor category i will be entangled with categories that consist of significantly more instances than it, i.e., the dominance of B i over the gradients at index i is weakened. To prove the statement, we execute attacks on CIFAR100 with both untrained and trained VGG-16 models in batches ranging from 24 to 648. The results in the Table 5 are the average recovered NoI(Number of Instances) of 20 repetitive experiments. We simply select 3 classes with class-id 0,18 and 92, whose NoIs are 1, BS-2 and 1 respectively (BS denotes the batch size). It can be seen that when the degree of imbalance exceeds a certain threshold for the trained model, the recovery effect for minor classes 0 and 92 deteriorates significantly. However, our method enables restoring labels of minor classes perfectly even for such a data distribution as 626:1 when untrained. This is actually because in this case, the errors arising from entanglement can be bypassed.

G EFFECT OF DEFENSE STRATEGIES

The key to defending against our attack is to avoid exchanging precise gradients. We discuss two defense schemes: 



Figure 1: Illustration of the threat model and the proposed method.

Figure 2: Performance comparison at different training stages.

Figure 3: The percentages of LeAcc and LnAcc as batch size and model depth increase under various settings. The FCN series models are distinguished by the number of hidden layers, where a 2-layer model has no hidden layers and the basic FCN-3 consists of a hidden layer (See Appendix B for the FCN-3 network architecture). And BS is short for batchsize.

THE DERIVATION OF ∇z i = ∂L ∂z i = p i -y iAccording to the definitions and notations given in Section 2.1, we have z = We + b and p = Sof tM ax(z) to model the mapping of the final layer and post-softmax process. First, the softmax function defines a transformation that p i = Sof tM ax(z)

as L = CE(p, y) = -C i=1 y i log(p i ) = -log(p c ) (c represents the index of the ground-truth class)

Figure 5: The impact of two typical defense strategies Differential Privacy and Gradient Sparsification on the label restoration results.

Figure 6: Additional contrast examples of inverting ResNet-18 gradients on CIFAR100 to demonstrate our improvements.

at 224 ×224 pixels. (4) the 50-layer ResNet-50 and other variations on the ImageNet dataset. We use the training set by default in the following experiments, and the images have been normalized during data loading stage. All statistics except those in Section 4.1 are averaged in a randomly selected batch on 50 repetitive tests. As mentioned previously, our attack is more effective on untrained models. Therefore, unless otherwise specified, we focus on the untrained model.Evaluation metrics. To quantitatively analyze the performance of our label restoration attack, we propose the following two metrics: (1) Label existence Accuracy-LeAcc: the accuracy score for predicting label existences; (2) Label number Accuracy-LnAcc: the accuracy score for predicting the number of instances per class. Furthermore, in Section 4.4, we adopt the image reconstruction quality score Peak Signal to Noise Ratio (PSNR) and the perceptual similarity score Learned Perceptual Image Patch Similarity (LPIPS)(Zhang et al., 2018) to measure the similarity between the restored images and the ground truth.

Quantitative Comparison of our label restoration attack with prior works on diverse scenarios. The batch size remains at 24. *: replace the activation function in LeNet-5 from ReLU to Swish(Ramachandran et al., 2017) which is applied in EfficientNet(Tan & Le, 2019); -: this metric is incapable of being calculated. CosSim: the Cosine Similarity for our recovered post-softmax probabilities.

The details of shape for each FCN-3 layer.

Table3below. We choose the models at epoch 0, 40 and 100 respectively as the representatives of the three stages of untrained, mid-training and well-trained. The three approximations and restored embeddings are our check items. And the evaluation metrics include MSE (Mean Squared Error), MRE (Mean Relative Error) and CosSim (Cosine Similarity), where MRE is the ratio of the error to the ground-truth.Algorithm 1 Class-Wise Embeddings Inference and Instance-Wise Batch Label Restoration.Input: Gradients of weight and bias in the final layer ∇W ∈ R C×m , ∇b ∈ R C . Output: Class-wise averaged embeddings E = {e i , i = 1, 2, .., C} and the number of occurences for each class N = {n i , i = 1, 2, .., C} in a training batch. 1: Initial E = ∅ and N = ∅. 2: for i = 1 to C do

Comparison of the errors at different training stages. Their MSE results are consistent with this property: best at the beginning, second at the end, and worst in the middle. However, since the gradient magnitude becomes smaller as the training progresses, the mid-term MRE may be less than the final (for Approx1 here).Secondly, Approx2 represents Inter-class Low Entanglement of Gradient Contributions. Due to the complexity of weight gradient errors, we choose to analyze the bias gradient. As the training progresses, MSE decreases but MRE increases, indicating that the entanglement is actually increasing.

Comparison of the embedding reconstruction quality. As shown in the Table 4, ourPublished as a conference paper at ICLR 2023 method outperforms Soteria in terms of the cosine similarity metric. This suffices to demonstrate that our restorations are of high quality. Moreover, we note that our restored embeddings are more precise than Soteria's when untrained, but it doesn't hold up well after training. In fact, MSE of the proposed method is almost always less than that of Soteria. Only in rare cases, a large MSE leads to a large average MSE.

The attack effect under the extreme distribution.

ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (32071775, U21B2021). Finally, Kailang Ma would like to extend great gratitude to my younger colleague Gaojian Xiong for his contribution to the theoretical aspect of this paper.

