DOES FEDERATED LEARNING REALLY NEED BACKPROPAGATION?

Abstract

Federated learning (FL) provides general principles for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. In experiments, we use BAFFLE to train models from scratch or to finetune pretrained models, achieving empirically acceptable results.

1. INTRODUCTION

Federated learning (FL) allows decentralized clients to collaboratively train a server model (Konečnỳ et al., 2016; McMahan et al., 2017) . In each training round, the selected clients compute model gradients or updates on their local private datasets, without explicitly exchanging sample points to the server. While FL describes a promising blueprint and has several applications (Yang et al., 2018; Hard et al., 2018; Li et al., 2020b) , the mainstream training paradigm of FL is still gradient-based that requires the clients to locally execute backpropagation, which leads to two practical limitations: (i) Overhead for edge devices. The clients in FL are usually edge devices, such as mobile phones and IoT sensors, whose hardware is primarily optimized for inference-only purposes (Sharma et al., 2018; Umuroglu et al., 2018) , rather than for backpropagation. Due to the limited resources, computationally affordable models running on edge devices are typically quantized and pruned (Wang et al., 2019a) , making exact backpropagation difficult. In addition, standard implementations of backpropagation rely on either forward-mode or reverse-mode auto-differentiation in contemporary machine learning packages (Bradbury et al., 2018; Paszke et al., 2019b) , which increases storage requirements. (ii) White-box vulnerability. To facilitate gradient computing, the server regularly distributes its model status to the clients, but this white-box exposure of the model renders the server vulnerable to, e.g., poisoning or inversion attacks from malicious clients (Shokri et al., 2017; Xie et al., 2020; Zhang et al., 2020; Geiping et al., 2020) . With that, recent attempts are made to exploit trusted execution environments (TEEs) in FL, which can isolate the model status within a black-box secure area and significantly reduce the success rate of malicious evasion (Chen et al., 2020; Mo et al., 2021; Zhang et al., 2021; Mondal et al., 2021) . However, TEEs are highly memory-constrained (Truong et al., 2021) , while backpropagation is memory-consuming to restore intermediate states. While numerous solutions have been proposed to alleviate these limitations (discussed in Appendix B), in this paper, we raise an essential question: does FL really need backpropagation? Inspired by the literature on zero-order optimization (Stein, 1981) , we intend to substitute backpropagation with multiple forward or inference processes to estimate the gradients. Technically speaking, we propose the framework of BAckpropagation-Free Federated LEarning (BAFFLE). As illustrated in Figure 1 , BAFFLE consists of three conceptual steps: (1) each client locally perturbs the model parameters 2K times as W ± δ k (the server sends the random seed to clients for generating {δ k } K k=1 ); (2) each client executes forward processes on the perturbed models using its private dataset D c and obtains K loss differences {∆L(W, δ k ; D c )} K k=1 ; (3) the server aggregates loss differences to estimate gradients. BAFFLE's defining characteristic is that it only utilizes forward propagation, which is memoryefficient and does not require auto-differentiation. It is well-adapted to model quantization and pruning as well as inference-only hardware optimization on edge devices. Compared to backpropagation, the computation graph of forward propagation in BAFFLE may be more easily optimized, such as by slicing it into per-layer calculation (Kim et al., 2020) . Since each loss difference ∆L(W, δ k ; D c ) is a scalar, BAFFLE can easily accommodate the uploading bandwidth of clients by adjusting the value of K as opposed to using, e.g., gradient compression (Suresh et al., 2017) . BAFFLE is also compatible with recent advances in inference approaches for TEE (Tramer & Boneh, 2019; Truong et al., 2021) , providing an efficient solution for combining TEE into FL and preventing white-box evasion. Base on our convergence analyses, we adapt secure aggregation (Bonawitz et al., 2017a) to zeroorder optimization and investigate ways to improve gradient estimation in BAFFLE. In our experiments, BAFFLE is used to train models from scratch on MNIST (LeCun et al., 1998) and CIFAR-10/100 (Krizhevsky & Hinton, 2009) , and finetune ImageNet-pretrained models to transfer to OfficeHome (Venkateswara et al., 2017) . Compared to conventional FL, BAFFLE achieves suboptimal but acceptable performance. These results shed light on the potential of BAFFLE and the effectiveness of backpropagation-free methods in FL.

2. PRELIMINARIES

In this section, we introduce the basic concepts of federated learning (FL) (Kairouz et al., 2021) and the finite difference formulas that will serve as the foundation for our methods. 

2.2. FINITE DIFFERENCE

Gradient-based optimization techniques (either first-order or higher-order) are the most frequently used tools to train deep networks (Goodfellow et al., 2016) . Nevertheless, recent progress demonstrates promising applications of zero-order optimization methods for training, particularly when exact derivatives cannot be obtained (Flaxman et al., 2004; Nesterov & Spokoiny, 2017; Liu et al., 2020a) or backward processes are computationally prohibitive (Pang et al., 2020; He et al., 2022) . Zero-order approaches require only multiple forward processes that may be executed in parallel. Along this routine, finite difference stems from the definition of derivatives and can be generalized to higher-order and multivariate cases by Taylor's expansion. For any differentiable loss function L(W; D) and a small perturbation δ ∈ R n , finite difference employs the forward difference scheme L(W + δ; D) -L(W; D) = δ ⊤ ∇ W L(W; D) + o(∥δ∥ 2 ), (2) where δ ⊤ ∇ W L(W; D) is a scaled directional derivative along δ. Furthermore, we can use the central difference scheme to obtain higher-order residuals as L(W + δ; D) -L(W -δ; D) = 2δ ⊤ ∇ W L(W; D) + o(∥δ∥ 2 2 ). (3) Finite difference formulas are typically used to estimate quantities such as gradient norm or Hessian trace, where δ is sampled from random projection vectors (Pang et al., 2020) .

3. BAFFLE: BACKPROPAGATION-FREE FEDERATED LEARNING

In this section, we introduce zero-order optimization techniques into FL and develop BAFFLE, a backpropagation-free federated learning framework that uses multiple forward processes in place of backpropagation. An initial attempt is to apply finite difference as the gradient estimator. To estimate the full gradients, we need to perturb each parameter w ∈ W once to approximate the partial derivative ∂L(W;D) ∂w , causing the forward computations to grow with n (recall that W ∈ R n ) and making it difficult to scale to large models. In light of this, we resort to Stein's identity (Stein, 1981) to obtain an unbiased estimation of gradients from loss differences calculated on various perturbations. As depicted in Figure 1 , BAFFLE clients need only download random seeds and global parameters update, generate perturbations locally, execute multiple forward propagations and upload loss differences back to the server. Furthermore, we also present convergence analyses of BAFFLE, which provides guidelines for model design and acceleration of training progress.

3.1. UNBIASED GRADIENT ESTIMATION WITH STEIN'S IDENTITY

Previous work on sign-based optimization (Moulay et al., 2019) demonstrates that deep networks can be effectively trained if the majority of gradients have proper signs. Thus, we propose performing forward propagation multiple times on perturbed parameters, in order to obtain a stochastic estimation of gradients without backpropagation. Specifically, assuming that the loss function L(W; D) is continuously differentiable w.r.t. W given any dataset D, which is true (almost everywhere) for deep networks using non-linear activation functions, we define a smoothed loss function L σ (W; D) := E δ∼N (0,σ 2 I) L(W + δ; D), where the perturbation δ follows a Gaussian distribution with zero mean and covariance σ 2 I. Given this, Stein (1981) proves the Stein's identity (we recap the proof in Appendix A.1), formulated as ∇ W L σ (W; D) = E δ∼N (0,σ 2 I) δ 2σ 2 ∆L(W, δ; D) , where ∆L(W, δ; D) := L(W + δ; D) -L(W -δ; D) is the loss difference. Note that computing a loss difference only requires the execution of two forward processes L(W + δ; D) and L(W -δ; D) without backpropagation. It is straightforward to show that L σ (W; D) is continuously differentiable for any σ ≥ 0 and ∇ W L σ (W; D) converges uniformly as σ → 0; hence, it follows that ∇ W L(W; D) = lim σ→0 ∇ W L σ (W; D) . Therefore, we can obtain a stochastic estimation of gradients using Monte Carlo approximation by 1) selecting a small value of σ; 2) randomly sampling K perturbations from N (0, σ 2 I) as {δ k } K k=1 ; and 3) utilizing the Stein's identity in Eq. ( 5) to calculate ∇ W L(W; D) := 1 K K k=1 δ k 2σ 2 ∆L(W, δ k ; D) . ( ) Algorithm 1 Backpropagation-free federated learning (BAFFLE) 1: Notations: Se denotes the operations done on servers; Cl denotes the operations done on clients; TEE for the TEE module; and ⇒ denotes the communication process. Se ⇒ all Cl: downloading the random seed s t ; # 4 Bytes 8: Se: sampling K perturbations {δ k } K k=1 from N (0, σ 2 I) using the random seed s t ; 9: all Cl: negotiating a group of zero-sum noises {ϵ c } C c=1 for secure aggregation; 10: for c = 1 to C do 11: Cl: sampling K perturbations {δ k } K k=1 from N (0, σ 2 I) using the random seed s t ; 12: Cl: computing TEE • ∆L(W t , δ k ; D c ) via forward propagation for each k; 13: Cl ⇒ Se: uploading K outputs TEE • ∆L(W t , δ k ; D c ) + N Nc ϵ c K k=1 ; # 4×K Bytes 14: end for 15: Se: aggregating ∆L(W t , δ k ) ← C c=1 Nc N TEE • ∆L(W t , δ k ; D c ) + N Nc ϵ c for each k; 16: Se: computing ∇ Wt L(W t ) ← 1 K K k=1 δ k 2σ 2 ∆L(W t , δ k ); 17: Se: W t+1 ← W t -η ∇ Wt L(W t ); 18: end for 19: Return: final model parameters W T .

3.2. OPERATING FLOW OF BAFFLE

Based on the forward-only gradient estimator ∇ W L(W; D) derived in Eq. ( 6), we outline the basic operating flow of our BAFFLE system as in Algorithm 1, which consists of the following: Model initialization. (Lines 3∼4, done by server) The server initializes the model parameters to W 0 and optionally encodes the computing paradigm of loss differences ∆L(W, δ; D) into the TEE module (see Appendix C for more information on TEE); Downloading paradigms. (Lines 6∼7, server ⇒ all clients) In round t, the server distributes the most recent model parameters W t (or the model update ∆W t = W t -W t-1 ) and the computing paradigm to all the C clients. In addition, in BAFFLE, the server sends a random seed s t (rather than directly sending the perturbations to reduce communication burden); Local computation. (Lines 11∼12, done by clients) Each client generates K perturbations {δ k } K k=1 locally from N (0, σ 2 I) using random seed s t , and executes the computing paradigm to obtain loss differences. K is chosen adaptively based on clients' upload bandwidth and computation capability; Uploading loss differences. (Line 13, all clients ⇒ server) Each client uploads K noisy outputs {∆L(W t , δ k ; D c ) + N Nc ϵ c } K k=1 to the server, where each output is a floating-point number and the noise ϵ c is negotiated by all clients to be zero-sum. The Bytes uploaded for K noisy outputs is 4×K; Secure aggregation. (Lines 15∼16, done by server) In order to prevent the server from recovering the exact loss differences and causing privacy leakage (Geiping et al., 2020) , we adopt the secure aggregation method (Bonawitz et al., 2017a) that was originally proposed for conventional FL and apply it to BAFFLE. Specifically, all clients negotiate a group of noises {ϵ c } C c=1 satisfying C c=1 ϵ c = 0. Then we can reorganize our gradient estimator as ∇ Wt L(W t ) = 1 K C c=1 N c N K k=1 δ k 2σ 2 ∆L(W t , δ k ; D c ) = 1 K K k=1 δ k 2σ 2 ∆L(W t , δ k ), where ∆L(W t , δ k ) = C c=1 Nc N [∆L(W t , δ k ; D c ) + N Nc ϵ c ]. Since {ϵ c } C c=1 are zero-sum, there is ∆L(W t , δ k ) = C c=1 Nc N ∆L(W t , δ k ; D c ) and Eq. ( 7) holds. Thus, the server can correctly aggregate ∆L(W t , δ k ) and protect client privacy without recovering individual ∆L(W t , δ k ; D c ). Remark. After calculating the gradient estimation ∇ Wt L(W t ), the server updates the parameters to W t+1 using techniques such as gradient descent with learning rate η. Similar to the discussion in McMahan et al. (2017) , the BAFFLE form presented in Algorithm 1 corresponds to FedSGD where Lines 11∼12 execute once for each round t. We can generalize BAFFLE to an analog of FedAvg, in which each client updates its local parameters multiple steps using the gradient estimator ∇ Wt L(W t , D c ) derived from ∆L(W t , δ k ; D c ) via Eq. ( 6), and upload model updates to the server.

3.3. CONVERGENCE ANALYSES

Now we analyze the convergence rate of our gradient estimation method. For continuously differentiable loss functions, we have ∇ W L(W; D) = lim σ→0 ∇ W L σ (W; D), so we choose a relatively small value for σ. The convergence guarantee can be derived as follows: Theorem 1. (Proof in Appendix A.2) For perturbations {δ k } K k=1 iid ∼ N (0, σ 2 I), the empirical covariance matrix is Σ := 1 Kσ 2 K k=1 δ k δ T k and mean is δ := 1 K K k=1 δ k . Then for any W ∈ R n , the relation between ∇ W L(W; D) and the true gradient ∇ W L(W; D) can be written as ∇ W L(W; D) = Σ∇ W L(W; D) + o( δ); s.t. E[ Σ] = I, E[ δ] = 0, ( ) where σ is a small value and the central difference scheme in Eq. ( 3) holds. When expectation is applied to both sides of Eq. ( 8 Note that in the finetuning setting, n represents the number of trainable parameters, excluding frozen parameters. As concluded, ∇ W L(W; D) provides an unbiased estimation for the true gradients with convergence rate of O n K . Empirically, ∇ W L(W; D) is used as a noisy gradient to train models, the generalization of which has been analyzed in previous work (Zhu et al., 2019; Li et al., 2020a) .

4. EXPERIMENTS

We evaluate the performance of BAFFLE on four benchmark datasets: MNIST (LeCun et al., 1998) , CIFAR-10/100 (Krizhevsky & Hinton, 2009) and OfficeHome (Venkateswara et al., 2017) . We consider three models: 1) LeNet (LeCun et al., 1998) with two convolutional layers as the shallow model (2.7 × 10 4 parameters); 2) WideResNet (Zagoruyko & Komodakis, 2016) with depth = 10 and width = 2 (WRN-10-2) as the light weight deep model (3.0 × 10 5 parameters) and 3) MobileNet (Howard et al., 2017) as the deep neural networks (1.3 × 10 7 parameters) that works on ImageNet. To perform a comprehensive evaluation of BAFFLE, we simulate three popular FL scenarios (Caldas et al., 2018b) with the participation tools from FedLab (Zeng et al., 2021) : iid participations, label non-iid participations and feature non-iid participations. For iid participations, we set the client number C = 10 and use uniform distribution to build local datasets. Then we evaluate our BAFFLE on MNIST and CIFAR-10/100 under both batch-level (FedSGD) and epochlevel (FedAvg) communication settings. For label non-iid participations, we set client number C = 100, use Dirichlet distribution to build clients. For feature non-iid participations, we build clients from the prevailing domain adaptation dataset OfficeHome, which contains 65 categories from 4 different domains, i.e. Art, Clipart, Product and Real-world. We set the total client number to C = 40 and generate 10 clients from each domain. As results, we report Top-1 accuracy for MNIST, CIFAR-10 and OfficeHome and Top-5 accuracy for OfficeHome and CIFAR-100.

4.1. EXPERIMENTAL SETTINGS

Following the settings in Section 2.1, we use FedAVG to aggregate gradients from multiple clients and use SGD-based optimizer to update global parameters. Specifically, we use Adam (Kingma & Ba, 2015) to train a random initialized model with β = (0.9, 0.99), learning rate 0.01 and epochs 20/40 for MNIST and CIFAR-10/100. For OfficeHome, we adapt the transfer learning strategy (Huh et al., 2016) by loading the pretrained model on ImageNet and finetuning the final layers with Adam, but setting learning rate 0.005 and epochs 40. In BAFFLE, the perturbation scale σ and number K are the most important hyperparameters. As shown in Theorem 1, with less noise and more samples, ) and batch-level communication settings with various K values. We treat the models trained by exact gradients on conventional FL systems as the ground truth. On different datasets and architectures, our BAFFLE achieves comparable performance to the exact gradient results with a reasonable K. the BAFFLE will obtain more accurate gradients, leading to improved performance. However, there exists a trade-off between accuracy and computational efficiency: an extremely small σ will cause the underflow problem (Goodfellow et al., 2016) and a large K will increase computational cost. In practice, we empirically set σ = 10 -4 because it is the smallest value that does not cause numerical problems in all experiments, and works well on edge devices with half-precision floating-point numbers. We also evaluate the impact of K across a broad range from 100 to 5000. For a general family of continuously differentiable models, we analyze their convergence rate of BAFFLE in Section 3.3. Since deep networks are usually stacked with multiple linear layers and non-linear activation, this layer linearity can be utilized to improve the accuracy-efficiency trade-off. Combining the linearity property and the unique conditions in edge devices (e.g., small data size and half-precision format), we present four guidelines for model design and training that can increase accuracy without introducing extra computation (for the details of linearity analysis, see Appendix D): Using twice forward difference (twice-FD) scheme rather than central scheme. Combining difference scheme Eq. ( 2) and Eq. ( 3), we find that by executing twice as many forward inferences (i.e.W ± δ), the central scheme achieves lower residuals than twice-FD, despite the fact that twice-FD can benefit from additional sample times. With the same forward times (e.g., 2K), determining which scheme performs better is a practical issue. As shown in Appendix D, we find that twice-FD performs better in all experiments, in part because the linearity reduces the benefit from second-order residuals. Using Hardswish in BAFFLE. ReLU is effective when the middle features (h(•) denotes the feature mapping) have the same sign before and after perturbations, i.e. h(W + δ) • h(W) > 0. Since ReLU is not differentiable at zero, the value jump occurs when the sign of features changes after perturbations, i.e. h(W + δ) • h(W) < 0. We use Hardswish (Howard et al., 2019) to overcome this problem as it is continuously differentiable at zero and easy to implement on edge devices. Using exponential moving average (EMA) to reduce oscillations. As shown in Theorem 1, there exists an zero-mean white-noise δ between the true gradient and our estimation. To smooth out the oscillations caused by white noise, we apply EMA strategies from BYOL (Grill et al., 2020) to the global parameters, with a smoothing coefficient of 0.995. Using GroupNorm as opposed to BatchNorm. On edge devices, the dataset size is typically small, which leads to inaccurate batch statistics estimation and degrades performance when using BatchNorm. Thus we employ GroupNorm (Wu & He, 2020) to solve this issue.

4.2. PERFORMANCE ON IID CLIENTS

Following the settings in Section 4.1, we evaluate the performance of BAFFLE in the iid scenarios. We reproduce all experiments on the backpropagation-based FL systems with the same settings and use them as the ground truth. We refer to the ground truth results as exact gradients and report the training process of BAFFLE in Figure 2 . The value of K (e.g., 200 for LeNet and 500 for WRN-10-2) is significantly less than the dimensions of parameter space (e.g., 2.7 × 10 4 for LeNet and 3 × 10 5 for WRN-10-2). Since the convergence rate to the exact gradient is O n K , the marginal benefit of increasing K decreases. For instance, increasing K from 2000 to 5000 on CIFAR-10 with WRN-10-2 barely improves accuracy by 2%. Given that the convergence rate of normal distribution is O n K , the sampling efficiency may be improved by choosing an alternative distribution for perturbations. Ablation studies. As depicted in Figure 3 , we conduct ablation studies for BAFFLE to evaluate the aforementioned guidelines. In general, twice-FD, Hardswish and EMA can all improve the accuracy. For two difference schemes, we compare the twice-FD to central scheme with the same computation cost and show that the former outperforms the later, demonstrating that linearity reduces the gain from second-order residuals. As to activation functions, Hardswish is superior to ReLU and SELU because it is differentiable at zero and vanishes to zero in the negative part. Moreover, EMA enhances the performance of training strategies by reducing the effect of white noise. Communication efficiency. Each client uploads a K-dimensional vector to the server and downloads the updated global parameters during the communication round of BAFFLE. Since K is significantly less than the parameter amounts (e.g., 500 versus 0.3 million), BAFFLE reduces data transfer by approximately half when compared to the batch-level communication settings (FedSGD) in a backpropagation-based FL system. In order to reduce communication costs, the prevalent FL system requires each client to perform model optimization on the local training dataset and upload the model updates to the server after a specified number of local epochs. BAFFLE can also communicate at the epoch level by employing an O(n) additional memory to store the perturbation in each forward process and estimate the local gradient using Eq. ( 6). Each client optimizes the local model with SGD and uploads local updates after a number of epochs. As shown in Table 1 , we evaluate the performance of BAFFLE under one-epoch communication settings. As epoch-level communication is more prevalent in the real-world FL, all the following experiments will be conducted in this context.

4.3. PERFORMANCE ON NON-IID CLIENTS

Following the settings in Section 4.1, we evaluate the performance of BAFFLE in both label non-iid and feature non-iid scenarios. For label non-iid scenarios, we use the CIFAR-10/100 datasets and employ Dirichlet distribution to ensure that each client has a unique label distribution. We evaluate the performance of BAFFLE with 100 clients and various K values. As seen in Table 2 , the model suffers a significant drop in accuracy (e.g., 14% in CIFAR-10 and 16% in CIFAR-100) due to the label non-iid effect. For feature non-iid scenarios, we construct clients using the OfficeHome dataset and use MobileNet as the deep model. As seen in Table 3 , we use the transfer learning strategy to train MobileNet, i.e., we load the parameters pretrained on ImageNet, freeze the backbone parameters, and retrain the classification layers. The accuracy decrease is approximately 3% ∼ 5%. BAFFLE uses K times forward passes instead of backward. Since the backward pass is about as expensive as two normal forward passes (Hinton & Srivastava, 2010) and five single-precision accelerated forward passes Nakandala et al. (2020) , BAFFLE results in approximately K 5 times the computation expense of BP-based FL. Although BAFFLE results in K 5 -1 times extra computation cost, we show the cost can be reduced with proper training strategies, e.g., the transfer learning in Table 3 can reduce K to 20 on the MobileNet and the 224 × 224 sized OfficeHome dataset. Moreover, BAFFLE can reduce huge memory cost on edge devices with the efficiency in static memory and dynamic memory. The auto-differential framework is used to run BP on deep networks, which requires extra static memory (e.g., 200MB for Caffe (Jia et al., 2014) and 1GB for Pytorch (Paszke et al., 2019a) ) and imposes a considerable burden on edge devices such as IoT sensors. Due to the necessity of restoring intermediate states, BP also requires enormous amounts of dynamic memory (≥ 5GB for MobileNet (Gao et al., 2020) ). Since BAFFLE only requires inference, we can slice the computation graph and execute the forward calculations per layer (Kim et al., 2020) . As shown in Table 4 , BAFFLE reduces the memory cost to 5%∼10% by executing inference-only computations layer-by-layer. By applying kernel-wise computations, we can further reduce the memory cost to approximately 1% (e.g., 64MB for MobileNet (Truong et al., 2021) ), which is suitable for scenarios with extremely limited storage resources, such as TEE. Recent works exploit TEE to protect models from white-box attacks (Kim et al., 2020) , which can defend against white-box attacks by preventing model exposure. However, due to the security guarantee, the usable memory of TEE is usually small (Truong et al., 2021 ) (e.g., 90MB on Intel SGX for Skylake CPU (McKeen et al., 2016) ), which is typically far less than what a backpropagationbased FL system requires. In contrast, BAFFLE can execute in TEE due to its little memory cost (more details are in Appendix C). Membership inference attack and model inversion attack need to repeatedly perform model inference and obtain confidence values or classification scores (Shokri et al., 2017; Zhang et al., 2020) . Given that BAFFLE provides stochastic loss differences ∆L(W, δ; D) associated with the random perturbation δ, the off-the-shelf inference attacks may not perform on BAFFLE directly (while adaptively designed attacking strategies are possible to evade BAFFLE). We further select random samples from the validation dataset and generate random input pairs as ( X, ỹ). As shown in Figure 4 , it is difficult to distinguish between real data and random noise, indicating that it is difficult for attackers to obtain useful information from BAFFLE's outputs.

5. CONCLUSION AND DISCUSSION

Backpropagation is the gold standard for training deep networks, and it is also utilized by traditional FL systems. However, backpropagation is unsuited for edge devices due to their limited resources and possible lack of reliability. Using zero-order optimization techniques, we explore the possibility of backpropagation-free FL in this paper. We need to specify that there are scenarios in which clients are fully trusted and have sufficient computing and storage resources. In these situations, traditional FL with backpropagation is preferred over BAFFLE. While our preliminary studies on BAFFLE have generated encouraging results, there are still a number of tough topics to investigate: (i) Compared to the models trained using exact gradients, the accuracy of models trained using BAFFLE is inferior. One reason is that we select small values of K (e.g., 500) relative to the number of model parameters (e.g., 3.0 × 10 5 ); another reason is that gradient descent is designed for exact gradients, whereas our noisy gradient estimation may require advanced learning algorithms. (ii) The empirical variance of zero-order gradient estimators affects training convergence in BAFFLE. It is crucial to research variance reduction approaches, such as control variates and non-Gaussian sampling distributions. (iii) Stein's identity is proposed for loss functions with Gaussian noises imposed on model parameters. Intuitively, this smoothness is related to differential privacy in FL, but determining their relationship requires theoretical derivations.

A PROOFS A.1 PROOF OF STEIN'S IDENTITY

We recap the proof of Stein's identity following He et al. (2022) , where ∇ W L σ (W; D) = ∇ W E δ∼N (0,σ 2 I) L(W + δ; D) = (2π) -n 2 • ∇ W L(W + δ; D) • exp - ∥δ∥ 2 2 2σ 2 dδ = (2π) -n 2 • L( W; D) • ∇ W exp - ∥ W -W∥ 2 2 2σ 2 d W = (2π) -n 2 • L( W; D) • W -W σ 2 • exp - ∥ W -W∥ 2 2 2σ 2 d W = (2π) -n 2 • L(W + δ; D) • δ σ 2 • exp - ∥δ∥ 2 2 2σ 2 dδ = E δ∼N (0,σ 2 I) δ σ 2 L(W + δ; D) . By symmetry, we change δ to -δ and obtain ∇ W L σ (W; D) = -E δ∼N (0,σ 2 I) δ σ 2 L(W -δ; D) , and further we prove that ∇ W L σ (W; D) = 1 2 E δ∼N (0,σ 2 I) δ σ 2 L(W + δ; D) -E δ∼N (0,σ 2 I) δ σ 2 L(W -δ; D) = E δ∼N (0,σ 2 I) δ 2σ 2 ∆L(W, δ; D) .

A.2 PROOF OF THEOREM 1

We rewrite the format of ∇ W L(W; D) as follows: ∇ W L(W; D) = 1 K K k=1 [ δ k 2σ 2 ∆L(W, δ k ; D)] = 1 K K k=1 [ δ k 2σ 2 2δ ⊤ k ∇ W L(W; D) + o(∥δ k ∥ 2 2 ) ] (using central scheme in Eq. (3)) = 1 Kσ 2 K k=1 [δ k δ ⊤ k ]∇ W L(W; D) + 1 K K k=1 δ k 2σ 2 o(∥δ k ∥ 2 2 ) = Σ∇ W L(W; D) + 1 K K k=1 δ k 2σ 2 o(∥δ k ∥ 2 2 ). Then we prove 1 K K k=1 δ k 2σ 2 o(∥δ k ∥ 2 2 ) = o( δ). Suppose δ k = (δ k,1 , • • • , δ k,n ), then we have ∥δ k ∥ 2 2 σ 2 = n i=1 ( δ k,i σ ) 2 . Since ∀i, δ k,i σ ∼ N (0, 1), we have ∥δ k ∥ 2 2 σ 2 ∼ χ 2 (n) and E( ∥δ k ∥ 2 2 σ 2 ) = n. So with high probability, o(∥δ k ∥ 2 2 ) σ 2 = o(n). Substituting it into Eq. ( 11), we have with high probability, 1 K K k=1 δ k 2σ 2 o(∥δ k ∥ 2 2 ) = δ • o(n) = o( δ), where we regard n as a constant for a given model architecture. Finally, we prove E[ δ] = 0 and E[ Σ] = I. It is trivial that E[ δ] = 0 since δ ∼ N (0, 1 Kσ 2 I). For E[ Σ] = I, we can observe by examining each of its entries Σ [ij] = 1 Kσ 2 K k=1 δ k[i] δ k[j] = 1 K K k=1 δ k[i] σ δ k[j] σ , where we have used subscripts [ij] and [i] to denote the usual indexing of matrices and vectors. Specifically, for diagonal entries (i.e., i = j), we observe K • Σ [ii] = K k=1 δ k[i] σ 2 distributes as χ 2 (K), which means E[ Σ [ii] ] = 1 = I [ii] and Var[ Σ [ii] ] = 2 K ; for non-diagonal entries (i.e., i ̸ = j), we have E[ Σ [ij] ] = 1 K K k=1 E δ k[i] σ δ k[j] σ = 1 K K k=1 E[δ k[i] ] σ E[δ k[j] ] σ = 0 = I [ij] , due to the independence between different dimensions in δ k .

B RELATED WORK

Along the research routine of FL, many efforts have been devoted to, e.g., dealing with non-IID distributions (Zhao et al., 2018; Sattler et al., 2019; Eichner et al., 2019; Wang et al., 2020b; Li et al., 2020c) , multi-task learning (Smith et al., 2017; Marfoq et al., 2021) , and preserving privacy of clients (Bonawitz et al., 2016; 2017b; McMahan et al., 2018; Truex et al., 2019; Hao et al., 2019; Lyu et al., 2020; Ghazi et al., 2020; Liu et al., 2020b) . Below we introduce the work on efficiency and vulnerability in FL following the survey of Kairouz et al. (2021) , which is more related to this paper. Efficiency in FL. It is widely understood that the communication and computational efficiency is a primary bottleneck for deploying FL in practice (Wang et al., 2019b; Rothchild et al., 2020; Chen et al., 2021; Balakrishnan et al., 2022; Wang et al., 2022) . Specifically, communicating between the server and clients could be potentially expensive and unreliable. The seminal work of Konečnỳ et al. (2016) introduces sparsification and quantization to reduce the communication cost, where several theoretical works investigate the optimal trade-off between the communication cost and model accuracy (Zhang et al., 2013; Braverman et al., 2016; Han et al., 2018; Acharya et al., 2020; Barnes et al., 2020) . Since practical clients usually have slower upload than download bandwidth, much research interest focuses on gradient compression (Suresh et al., 2017; Alistarh et al., 2017; Horváth et al., 2019; Basu et al., 2019) . On the other hand, different methods have been proposed to reduce the computational burden of local clients (Caldas et al., 2018a; Hamer et al., 2020; He et al., 2020) , since these clients are usually edge devices with limited resources. Training paradigms exploiting tensor factorization in FL can also achieve promising performance (Kim et al., 2017; Ma et al., 2019) . Vulnerability in FL. The characteristic of decentralization in FL is beneficial to protecting data privacy of clients, but in the meanwhile, providing white-box accessibility of model status leaves flexibility for malicious clients to perform poisoning/backdoor attacks (Bhagoji et al., 2019; Bagdasaryan et al., 2020; Wang et al., 2020a; Xie et al., 2020; Pang et al., 2021) , model/gradient inversion attacks (Zhang et al., 2020; Geiping et al., 2020; Huang et al., 2021) , and membership inference attacks (Shokri et al., 2017; Nasr et al., 2019; Luo et al., 2021) . To alleviate the vulnerability in FL, several defense strategies have been proposed via selecting reliable clients (Kang et al., 2020) , data augmentation (Borgnia et al., 2021) , update clipping (Sun et al., 2019) , robust training (Li et al., 2021) , model perturbation (Yang et al., 2022) , detection methods (Seetharaman et al., 2021; Dong et al., 2021) , and methods based on differential privacy (Wei et al., 2020) , just to name a few.

C TRUSTED EXECUTION ENVIRONMENTS

A trusted execution environment (TEE) (Sabt et al., 2015) is regarded as the ultimate solution for defending against all white-box attacks by preventing any model exposure. TEE protects both data and model security with three components: physical secure storage to ensure the confidentiality, integrity, and tamper-resistance of stored data; a root of trust to load trusted code; and a separate kernel to execute code in an isolated environment, as illustrated in Figure 5 . Using TEE, the FL system is able to train deep models without revealing any model specifics. However, due to the security guarantee, the usable memory of TEE is typically small (Truong et al., 2021 ) (e.g., 90MB



Figure1: A sketch map of BAFFLE. In addition to the global parameters update ∆W, each client downloads random seeds to locally generate perturbations ±δ 1:K and perform 2K times of forward propagation (i.e., inference) to compute loss differences. The server can recover these perturbations using the same random seeds and obtain ∆L(W, δ k ) by secure aggregation. Each loss difference ∆L(W, δ k ; D c ) is a floating-point number, so K can be easily adjusted to fit the uploading bandwidth.

FEDERATED LEARNING Suppose we have C clients, and the c-th client's private dataset is defined as D c := {(X c i , y c i )} Nc i=1 with N c input-label pairs. Let L(W; D c ) represent the loss function calculated on the dataset D c , where W ∈ R n denotes the server model's global parameters. The training objective of FL is to find W that minimize the total loss function as L(W) := C c=1 N c N L(W; D c ), where N = C c=1 N c . (1) In the conventional FL framework, clients locally compute gradients {∇ W L(W; D c )} C c=1 or model updates through backpropagation and then upload them to the server. Federated average (McMahan et al., 2017) performs global aggregation using ∆W := C i=1 Nc N ∆W c , where ∆W c is the local update obtained via executing W c ← W c -η∇ Wc L(W c ; D c ) multiple times and η is learning rate.

2: Inputs: C clients with local dataset {D c } C c=1 containing N c input-label pairs, N = C c=1 N c ; learning rate η, training iterations T , perturbation number K, noise scale σ. 3: Se: initializing model parameters W ← W 0 ; 4: Se: encoding the computing paradigm into TEE as TEE • ∆L(W, δ; D); # optional 5: for t = 0 to T -1 do 6:Se ⇒ all Cl: downloading model parameters W t and the computing paradigm; 7:

), we obtain E[ ∇ W L(W; D)] = ∇ W L(W; D), which degrades to Stein's identity. To determine the convergence rate w.r.t. the value of K, we have Theorem 2. (Adamczak et al. (2011)) With overwhelming probability, the empirical covariance matrix satisfies the inequality ∥ Σ -I∥ 2 ≤ C 0 n K , where ∥ • ∥ 2 denotes the operator 2-norm for matrix and C 0 is an absolute positive constant.

Figure2: The classification accuracy (%) of BAFFLE in iid scenarios (C = 10) and batch-level communication settings with various K values. We treat the models trained by exact gradients on conventional FL systems as the ground truth. On different datasets and architectures, our BAFFLE achieves comparable performance to the exact gradient results with a reasonable K.

Figure3: The ablation study of our BAFFLE guidelines, with K = 100 on MNIST and K = 500 on CIFAR-10. As seen, twice-FD, Hardswish, and EMA all improve performance with the same computational cost. EMA reduces oscillations in training by lessening the effect of white noise.

Figure4: The robustness of BAFFLE to inference attacks. For real data, we randomly sample some input-label pairs from the validation dataset. For random noise, we generate input-label pairs from standard normal distribution. We sample 500 perturbations δ from N (0, σ 2 I), collect the values of ∆L(W, δ; D) for real data and random noise separately, and compare their distributions.

The classification accuracy (%) of BAFFLE in iid scenarios (C = 10) and epoch-level communication settings with different K values (K 1 /K 2 annotations mean using K 1 for MNIST and K 2 for CIFAR-10/100). In this configuration, each client updates its local model based on BAFFLE estimated gradients and uploads model updates to the server after an entire epoch on the local dataset. The four guidelines work well under epoch-level communication settings.

The classification accuracy (%) of BAFFLE in label non-iid scenarios (C = 100) and epoch-level communication settings with different K values. We employ Dirichlet distribution to ensure that each client has a unique label distribution.

The Top-1 | Top-5 classification accuracy (%) of BAFFLE on OfficeHome with feature non-iid participations (C = 40) and epoch-level communication settings. We utilize the pretrained MobileNet, freeze the backbone parameters, and retrain the final classification layers.

The GPU memory cost (MB) of vanilla backpropagation and BAFFLE, respectively. Here 'min∼max' denotes the minimum and maximum dynamic memory requirements for BAFFLE. We also report the ratio of vanilla backpropagation to BAFFLE's maximal memory cost.

D CONVERGENCE ANALYSES OF DEEP LINEAR NETWORKS IN BAFFLE

We analyze the convergence of BAFFLE in Section 3.3 using a general technique applicable to any continuously differentiable models corresponding to the loss function L(W; D). Since deep networks are the most prevalent models in FL, which has strong linearity, it is simpler to investigate the convergence of deep linear networks (Saxe et al., 2013) .Consider a two-layer deep linear network in a classification task with L categories. We denote the model parameters as {W 1 , W 2 }, where in the first layer W 1 ∈ R n×m , in the second layer W 2 ∈ R L×n consists of L vectors related to the L categories as {w l 2 } L l=1 and w c 2 ∈ R 1×n . For the input data X ∈ R m×1 with label y, we train the deep linear network by maximizing the classification score on the y-th class. Since there is no non-linear activation in deep linear networks, the forward inference can be represented as h = w y 2 W 1 X, and the loss is -h. It is easy to show that ∂h ∂w y 2 = (W 1 X) ⊤ and ∂h ∂W1 = (Xw y 2 ) ⊤ . We sample δ 1 , δ 2 from noise generator N (0, σ 2 I), where δ 1 ∈ R n×m and δ 2 ∈ R 1×n . Let h(δ 1 , δ 2 ) := (w y 2 + δ 2 )(W 1 + δ 1 )X, we discover that the BAFFLE estimation in Eq. ( 6) follows the same pattern for both forward (2) and central schemes (3):This equivalent form in deep linear networks illustrates that the residual benefit from the central scheme is reduced by the linearity, hence the performance of the two finite difference schemes described above is same in deep linear networks. We refer to this characteristic as FD scheme independence. We also find the property of σ independence, that is, the choice of σ does not effect the results of finite difference, due to the fact that δ1 σ and δ2 σ follow the standard normal distribution. Based on the findings from Eq. ( 13), we propose the following useful guideline that improves accuracy under the same computation cost: Using twice forward difference (twice-FD) scheme rather than central scheme. Combining the forward scheme (2) and central scheme (3), we find that the central scheme produces smaller residuals than the forward scheme by executing twice as many forward inferences, i.e. W ± δ. With the same forward inference times (e.g., 2K), one practical difficulty is to identify which scheme performs better. We find that the forward scheme performs better in all experiments, in part because the linearity reduces the benefit from second-order residuals, as demonstrated by Eq. ( 13).

