P2PRISM -PEER TO PEER LEARNING WITH INDIVID-UAL PRISM FOR MODEL AGGREGATION

Abstract

Federated learning (FL) has made collaboration between nodes possible without explicit sharing of local data. However, it requires the participating nodes to trust the server and its model updates, the server itself being a critical node susceptible to failure and compromise. A loss of trust in the server and a demand to aggregate the model independently for oneself has led decentralized peer-to-peer learning (P2PL) to gain traction lately. In this paper, we highlight the never before exposed vulnerabilities of P2PL towards malicious attacks and show the differences between the behavior of P2PL and FL in a malicious environment. We then present a robust defense -P2PRISM as a secure aggregation protocol for P2PL.



) all the received gradients to update the global model. For the sake of simplicity, we assume that the entire process is synchronous -the server waits to hear from all the clients before aggregation and all clients receive the same global model from the server after aggregation, that is, the clients always have to agree on the global model sent by the server and replace its local model with it before continuing with the local training. Although the aggregation technique being used may be known to all, but the actual aggregation is hidden from the clients for privacy concerns as it has been shown that access to a client's gradients can be used to recover its local data in an approximate or an exact way by optimization Geiping et al. (2020 ) or analytical Fowl et al. (2021) methods respectively. It is therefore not possible for clients to selectively choose other clients' gradients to aggregate even if it benefits them from any existing spatial locality among the clients. The clients have to trust the server to also aggregate the gradients in a byzantine-robust manner. Unless the server itself possesses a root dataset Cao et al. (2021) that correctly represents the entirety of data possessed by all clients as the ground truth, it is difficult for it to correctly identify malicious updates statistically without being extremely conservative and removing any suspected gradients leading to a significant loss of information. Whereas a node does have access to its own generated gradients as the benign ground truth and can make use of it, given the power to aggregate the model for itself. Due to the above mentioned reasons, and several others, a node is motivated to lose trust in a server and join a decentralized collaboration among the other nodes.

1.2. COMPARISON WITH FEDERATED LEARNING

In FL, the server aggregates the gradients from all the clients. However, in P2PL, a node may choose to communicate only with its neighbors in every round and locally aggregate the received models. If the graph formed by the nodes is not fully connected with equally weighed edges, the nodes are going to have models that differ from each other at every point in time even after their local aggregation. Consensus distance (δ) of the graph is defined as the average distance of each of the m local models (x i ) from the centroid (x) of them all, known to an oracle. It is measured after every aggregation step. Needless to say this takes a zero value in FL and nonzero value in P2PL when the graph is not fully connected. A non-zero consensus value implies that sharing gradients as in FL is not appropriate in P2PL and the nodes need to share the actual model weights with their neighbors because the local gradients have been computed on already differing local models. The nodes, after performing local SGD on their own data, where the nodes' data can be assumed to be non-IID with bias = b to incentivize collaboration, tend to diverge from each other. δ := 1 m i=m i=1 ∥x i -x∥ In FL, the server enforces full consensus among the clients, whereas in P2PL, the nodes perform one or more rounds of gossip averaging (GA) with their neighbors to keep the consensus distance in check. It is always wise to initialize all nodes in P2PL with the same model, that is, zero initial consensus distance to aid consensus control in the later stages of training, as also demonstrated in the toy experiment in Figure 1 . We see that with same initialization, the nodes benefit from collaboration as k increases to 5 and 10 from 1 (individual training). This is also why every client in FL benefits from collaboration as their models are made to synchronize by the server. Whereas, when nodes are initialized differently, they hurt each other with collaboration, because they are all on different trajectories towards learning an optimal model, as the same ML problem may have multiple solutions depending on the initialization. In fact, we observe that the nodes are hurt even with k = m = 10 when after a single round of gossip, the consensus distance falls down to 0 and stays at 0. Hence, keeping the pre-gossip consensus distance low is as important as keeping the post-gossip consensus distance, which is why we recommend setting the initial pre-gossip consensus distance to 0 by initializing all the nodes with the same model. Assuming a weighted mean aggregation, in FL, the server assigns weights to all the clients. In P2PL, every node is responsible for its own aggregation, and maintains a vector of weights (zero or non-zero) it assigns to every other node, including itself. These vectors when stacked on top of each other form the mixing matrix W of the graph and define its topology. If x is a matrix with x i being the model of the i th node, then the gossip averaging step is captured by the matrix operation W x. It is to be noted that a node is only affected by its direct neighbors in one round of gossip averaging, but can be affected by an indirectly connected neighbor if multiple rounds are performed. For example two rounds of GA with matrix W results in the models W • W x which is effectively a single round of GA with matrix W 2 where W 2 is less sparse than W for positive entries in W . It should also be obvious that more the number of gossip averaging steps, the tighter the consensus control Kong et al. (2021) .

1.3. P2PL IN A BYZANTINE ENVIRONMENT

P2PL behaves differently from FL under a malicious setting. Assuming the same model poisoning attack in both the cases, with c out of m nodes being malicious or compromised, while the FL server has to deal with f (= c m ) fraction of malicious nodes, the nodes in P2PL have to deal with



FOR PEER-TO-PEER LEARNING FL McMahan et al. (2017); Konečnỳ et al. (2016) has demonstrated how clients can benefit from collaboration by sharing their local gradient updates to the parameter server, which in turn aggregates Yin et al. (2018a); Blanchard et al. (2017); Guerraoui et al. (2018); Xia et al. (2019); Fung et al. (2020); Cao et al. (

Figure 1: Left: Training a k-regular peer-to-peer graph with varying values of k with same and different model initialization.We see that it is necessary to initialize nodes with the same model to expect benefit from collaboration (k > 1). Right: Demonstrating the spread of the attack in a k-regular graph where only 2 out of 128 nodes (shown in red) are malicious. We can see that the larger the collaboration (k), the greater the spread of the attack if the aggregation used is insecure, such as gossip averaging in this case.

