P2PRISM -PEER TO PEER LEARNING WITH INDIVID-UAL PRISM FOR MODEL AGGREGATION

Abstract

Federated learning (FL) has made collaboration between nodes possible without explicit sharing of local data. However, it requires the participating nodes to trust the server and its model updates, the server itself being a critical node susceptible to failure and compromise. A loss of trust in the server and a demand to aggregate the model independently for oneself has led decentralized peer-to-peer learning (P2PL) to gain traction lately. In this paper, we highlight the never before exposed vulnerabilities of P2PL towards malicious attacks and show the differences between the behavior of P2PL and FL in a malicious environment. We then present a robust defense -P2PRISM as a secure aggregation protocol for P2PL.

1. INTRODUCTION

1.1 MOTIVATION FOR PEER-TO-PEER LEARNING FL McMahan et al. (2017) ; Konečnỳ et al. (2016) has demonstrated how clients can benefit from collaboration by sharing their local gradient updates to the parameter server, which in turn aggregates Yin et al. (2018a) ; Blanchard et al. (2017) ; Guerraoui et al. (2018) ; Xia et al. (2019) ; Fung et al. (2020) ; Cao et al. (2020) all the received gradients to update the global model. For the sake of simplicity, we assume that the entire process is synchronous -the server waits to hear from all the clients before aggregation and all clients receive the same global model from the server after aggregation, that is, the clients always have to agree on the global model sent by the server and replace its local model with it before continuing with the local training. Although the aggregation technique being used may be known to all, but the actual aggregation is hidden from the clients for privacy concerns as it has been shown that access to a client's gradients can be used to recover its local data in an approximate or an exact way by optimization Geiping et al. (2020) or analytical Fowl et al. (2021) methods respectively. It is therefore not possible for clients to selectively choose other clients' gradients to aggregate even if it benefits them from any existing spatial locality among the clients. The clients have to trust the server to also aggregate the gradients in a byzantine-robust manner. Unless the server itself possesses a root dataset Cao et al. (2021) that correctly represents the entirety of data possessed by all clients as the ground truth, it is difficult for it to correctly identify malicious updates statistically without being extremely conservative and removing any suspected gradients leading to a significant loss of information. Whereas a node does have access to its own generated gradients as the benign ground truth and can make use of it, given the power to aggregate the model for itself. Due to the above mentioned reasons, and several others, a node is motivated to lose trust in a server and join a decentralized collaboration among the other nodes.

1.2. COMPARISON WITH FEDERATED LEARNING

In FL, the server aggregates the gradients from all the clients. However, in P2PL, a node may choose to communicate only with its neighbors in every round and locally aggregate the received models. If the graph formed by the nodes is not fully connected with equally weighed edges, the nodes are going to have models that differ from each other at every point in time even after their local aggregation. Consensus distance (δ) of the graph is defined as the average distance of each of the m local models (x i ) from the centroid (x) of them all, known to an oracle. We see that it is necessary to initialize nodes with the same model to expect benefit from collaboration (k > 1). Right: Demonstrating the spread of the attack in a k-regular graph where only 2 out of 128 nodes (shown in red) are malicious. We can see that the larger the collaboration (k), the greater the spread of the attack if the aggregation used is insecure, such as gossip averaging in this case. δ := 1 m i=m i=1 ∥x i -x∥ It is measured after every aggregation step. Needless to say this takes a zero value in FL and nonzero value in P2PL when the graph is not fully connected. A non-zero consensus value implies that sharing gradients as in FL is not appropriate in P2PL and the nodes need to share the actual model weights with their neighbors because the local gradients have been computed on already differing local models. The nodes, after performing local SGD on their own data, where the nodes' data can be assumed to be non-IID with bias = b to incentivize collaboration, tend to diverge from each other. In FL, the server enforces full consensus among the clients, whereas in P2PL, the nodes perform one or more rounds of gossip averaging (GA) with their neighbors to keep the consensus distance in check. It is always wise to initialize all nodes in P2PL with the same model, that is, zero initial consensus distance to aid consensus control in the later stages of training, as also demonstrated in the toy experiment in Figure 1 . We see that with same initialization, the nodes benefit from collaboration as k increases to 5 and 10 from 1 (individual training). This is also why every client in FL benefits from collaboration as their models are made to synchronize by the server. Whereas, when nodes are initialized differently, they hurt each other with collaboration, because they are all on different trajectories towards learning an optimal model, as the same ML problem may have multiple solutions depending on the initialization. In fact, we observe that the nodes are hurt even with k = m = 10 when after a single round of gossip, the consensus distance falls down to 0 and stays at 0. Hence, keeping the pre-gossip consensus distance low is as important as keeping the post-gossip consensus distance, which is why we recommend setting the initial pre-gossip consensus distance to 0 by initializing all the nodes with the same model. Assuming a weighted mean aggregation, in FL, the server assigns weights to all the clients. In P2PL, every node is responsible for its own aggregation, and maintains a vector of weights (zero or non-zero) it assigns to every other node, including itself. These vectors when stacked on top of each other form the mixing matrix W of the graph and define its topology. If x is a matrix with x i being the model of the i th node, then the gossip averaging step is captured by the matrix operation W x. It is to be noted that a node is only affected by its direct neighbors in one round of gossip averaging, but can be affected by an indirectly connected neighbor if multiple rounds are performed. For example two rounds of GA with matrix W results in the models W • W x which is effectively a single round of GA with matrix W 2 where W 2 is less sparse than W for positive entries in W . It should also be obvious that more the number of gossip averaging steps, the tighter the consensus control Kong et al. (2021) .

1.3. P2PL IN A BYZANTINE ENVIRONMENT

P2PL behaves differently from FL under a malicious setting. Assuming the same model poisoning attack in both the cases, with c out of m nodes being malicious or compromised, while the FL server has to deal with f (= c m ) fraction of malicious nodes, the nodes in P2PL have to deal with a variable fraction of malicious neighbors which could be greater than f for many, depending on the distribution and connectivity of the nodes. Attackers can not only impact their direct neighbors, but the impact spreads like a disease with either multiple GA steps in the same round, or even with a single GA step when other nodes come in contact with the poisoned nodes in the next round before the infected could recover themselves during their local SGD. In this manner, the attacker also succeeds in magnifying the consensus distance among the nodes that makes training even more difficult. Depending on the graph topology and distribution of the attackers, although an attacker may not be able to impact every node as in FL, the impact on affected nodes can be higher because some nodes are bound to have ≥ f fraction of their neighbors malicious. Figure 1 demonstrates this phenomenon in P2PL in comparison with FL.

1.4. THREAT MODEL

We have extended the state-of-the-art model poisoning attack, SHEJWALKAR attack Shejwalkar & Houmansadr (2021) in FL to the P2PL case. The malicious nodes collaborate among themselves to access the complete knowledge of the current as well as the past state of the benign models. A perturbation unit vector is constructed along the direction in which the average benign model is moving, and would be scaled up and subtracted from the past averaged model to send the nodes along the direction of gradient ascent. The scaling factor is chosen by solving an optimizing function that balances between attack impact and stealth, given that we know the aggregation technique being used. The attack thus amplifies the distance between an infected and an uninfected node, leading them to disagreement and thereby breaking consensus among the nodes.

1.5. OUR DEFENSE -P2PRISM

Our defense is based on the key intuition that the direction of the received models should be of utmost importance during aggregation. This is difficult in FL as there is no way for the server to know which clients are benign to use as reference to compute the direction of the updates. Naturally, none of the aggregation techniques that are expected to be byzantine-robust in FL make use of this fact. In P2PL, a node can use its own local update as the reference to detect parameter-wise direction flip in the updates received from its neighbors. It can then filter out suspected updates based on a defence policy that we call as P RISM before aggregating the received updates. We describe this policy in much detail in the next section, and we use it to completely undo the damage done by the attack, and reviving upto 100% of the benign nodes in a graybox attack and upto 88% of them in a k-regular graph in the presence of a whitebox attack. In summary, our contributions are - • We demonstrate the effect of byzantine nodes in P2PL and propose a defense against the state-of-the-art attack by leveraging the ground truth information available in P2PL as opposed to FL, reviving upto 100% of the nodes from the effect of the attack. • We explore the behavior of P2PL across k-regular and power-law graph topologies and highlight the subtleties involved in training like model initialization that affect consensus among the nodes. We also empirically show how P2PRISM keeps the consensus distance under control. • We evaluate P2PRISM on the image datasets of MNIST and FMNIST, and the NLP dataset of Shakespeare under graybox as well as fully whitebox adaptive attacker capabilities to prove effectiveness of our defense principle under the ultimate stress-test environment, where upto 88% of the nodes in a k-regular graph and 66% nodes in a power-law graph were completely revived even when 1/4-th of the nodes were under the control of a whitebox attacker.

2. DESIGN

P2PRISM is instantiated at every node to perform secure aggregation. It provides the node with a prism that blocks the passage of malicious updates into the aggregated model. P2PRISM uses a metric called flip-score Sharma et al. (2021) (FS) to detect malicious updates and helps a node keep a record of the reputation score of all its neighbors. The flip-score is expected to capture any large deflection as compared to a trusted benign update. A node i with model x i,t sets ∇x ii = x i,t -x i,t-1 as the trusted reference. All models that is receives from its neighbors j are adjusted according to this reference: ∇x ij = x j,t -x i,t-1 . The flip-score of j as computed by i is F S ij = |P |-1 k=0 (|∇x ij [k]|) 2 × (sign(∇x ij [k]) ̸ = (sign(∇x ii [k])). (2) where |P | is the total number of parameters (weights and biases) in the model. Since a node can trust its own update ∇x ii to go towards gradient descent, a high flip-score naturally captures an update that is likely going towards a gradient ascent -with either a large number of parameters going in the opposite direction with a small magnitude or a small number of them with a large magnitude. Cosine similarity is also based on a similar intuition but it does not capture a subtle attack that targets model parameters with zero gradients to prevent them from converging. With this, we can say that an abnormally large FS is indicative of an attack but it is still not enough by itself without a defense policy. In ). That may work when the attackers are not in a majority, by trimming out a fixed number of updates in every iteration and penalizing the reputation of the respective clients. The attackers may, however form a local majority of variable size at different regions in P2PL, and complete security cannot be maintained with a constant c max . We describe how P2PRISM handles this in the following steps. 1. Finding the cutoff FS -In P2PRISM, every node finds the minimum (F S min ) and median FS (F S med ) for all the updates it has received in a given iteration. It then sets a cutoff FS at F S med + µ(F S med -F S min ), where µ ≥ 0 is our only defense parameter. A lower µ sets a conservative policy whereas a large value sets a lenient one. For a robust computation, it is suggested to use a lower percentile instead of the absolute minimum, when the number of neighbors is sufficiently large (≥ 100 for example). 2. Reward or penalize neighbors -All neighbors with FS less than the cutoff are given a constant reward of 1 unit, which prevents a node from accumulating high rewards as compared to others. All others are penalized by the variable amount max( F S-F S med F S med -F Smin , 5) so that those close to but higher than the median are penalized less than those far from and higher than the same. The penalty is upper capped so that a node is allowed to redeem itself if it acts benign in the future. 3. Update reputation -The reward or penalty is added to the previous reputation score that was initialized to 1 k for a neighbor of a node that has k -1 neighbors. In this way, we make use of the past behavior of every neighbor. The reputation scores are normalized by dividing each value with the total positive reputation score of all neighboring nodes. Avoiding negative weights for normalization prevents the denominator from getting too small. All neighbors with a negative reputation are filtered out by this defense policy. . 4. Aggregation -The remaining updates, including a node's self update, undergo mean aggregation. A weighted mean weighed by the reputation is intentionally avoided to prevent the effect of any existing bias. For example. Nodes with similar updates could assign each other a high weight and isolate themselves from the rest of the graph, and not benefit from collaboration to learn a generalized model. This would also result in a higher consensus distance among the nodes. In order to protect a node surrounded by malicious majority, we also cap its F S med to be ≤ 10 * F S min . Without this, the median value could be poisoned leading to a victory of the attackers. The pseudocode of P2PRISM is described in Algorithm 1. P2PRISM is instantiated at every node in the peer-to-peer graph. It is to be noted that we use FS only to trim out the anomalies, that affects the mixing matrix weights in every round. Hence the convergence analysis follows from already existing work Assran et al. (2019) that proves convergence for changing topologies under column stochasticity condition. We do not assume a doubly stochastic mixing matrix as in Koloskova et al. (2020) or column stochastic as in Assran et al. (2019) but for ideal security want all entries in the column for a malicious node to be 0. Hence, we only assume it to be row stochastic. The rest of the assumptions are standard as follows - Assumption #1: (L-smoothness). We assume that for models x, yinR d , and local gradients ∇f i (•) of node i, there exists a constant L > 0 such that ∥∇f i (y) -∇f i (x)∥ ≤ L ∥y -x|. Algorithm 1 Peer-to-peer learning with P2PRISM Input: Node i, current model x i (t, •), Neighbor updates x j (t + 1, •), neighbors' reputation W R (t, j) Output: Aggregated model x i (t + 1, •) Parameters: µ 0 : Compute reference vector ∇x ij = x i (t + 1, •) -x i (t, •) 1 : for every neighbor j compute flip-score: F S ij = |P |-1 k=0 (|∇x ij [k]|) 2 × (sign(∇x ij [k]) ̸ = (sign(∇x ii [k])) 2 : Compute cutoff flipscore: F S min,i = min j {F S ij } F S med,i = min(median j {F S ij }, 10 × F S min,i ) F S i,cut = F S med,i + µ × (F S med,i -F S min,i ) 3 : Penalize neighbors with high FS: W R (t + 1, j) = W R (t, j) -max{ F Sij -F S med,i F S med,i -F Smin,i , 5} 4 : Reward the rest of the neighbors: W R (t + 1, j) = W R (t, j) + 1 5 : Normalize reputation weights: W R (t + 1, j) = W R (t+1,j) j:W R (t+1,j)>0 W R (t+1,j) 6 : Aggregate gradients: x i (t + 1, •) = x i (t, •) + avg j:W R (t+1,j)>0 (x j (t + 1, •) -x i (t, •)) Assumption #2: (Bounded variance). Let ξ j t be sampled uniformly from the local data D j of the j -th node, then the variance of stochastic gradients of each client is bounded, that is, E ξ∼Di ∥∇F i (x; ξ) -∇f i (x)∥ 2 ≤ σ 2 , ∀i, ∀x , and the variance across nodes is also bounded, 1 m m i=1 ∥∇f i (x) -∇f (x)∥ 2 ≤ ζ 2 ∀x. Assumption #3:(Mixing connectivity) To each mixing matrix P (k) at iteration k, we can associate a graph with vertex set {1, . . . , m} and edge set E (k) = {(i, j) : w (k) i.j > 0}. We assume that there exists finite, positive integers, B and ∆, such that the graph with edge set k) is strongly connected and has diameter at most ∆ for every l ≥ 0. (l+1)B-1 k=lB E ( We show that for for K greater than a finite limit, x (k) = 1 m m i=1 x (k) i converges, that is, 1 K K k=1 E ∇f x (k) 2 ≤ ϵ. We prove in Theorem 1 in Appendix §A.1 that under the above assumptions, and the definitions above and those made in §A.1, for K ≥ max          nL 4 1 C 4 (60) 2 (1 -q) 4 , nL 4 1 C 4 P 1 (1 -q) 2 f 1 n n l w (0) l x (0) l -f * √ nK 2 , nP 2 f 1 n n l w (0) l x (0) l -f * √ nK          The weighted mean of the local models converges in at least time K, that is, 1 4 K-1 k=0 E ∇f 1 n n l=1 w (k) l x (k) l 2 K ≤ 3 f 1 n n l w (0) l x (0) l -f * √ nK (4)

3. IMPLEMENTATION

We have performed our experiments on two graph topologies -k-regular and power-law graphs. Every node in the k-regular graph was made to communicate with k -1 other neighbors in order to do a k-gossip averaging in every round. In order to make the in degree (the number of nodes from which updates are received, including the self node) equal to the out degree (the number of neighbors to which one's updates are sent), the nodes were simulated to be situated around a ring, and a symmetric communication was established between a node with ⌈ k-1 2 ⌉ nodes ahead of it and ⌊ k-1 2 ⌋ nodes behind it. In a power-law graph, some nodes are popular with many connections, while There would be 1 another node with in degree = m to ensure that the in degree distribution is in adherence with the power-law and all nodes have been assigned some in degree value from the available degree values. c out of m nodes were randomly chosen to be malicious with their models crafted with the SHE-JWALKAR attack with unit vector perturbation in every iteration. The source code was obtained from https://github.com/vrt1shjwlkr/NDSS21-Model-Poisoning and the same attack parameter (λ 0 = 10.0) was used. The gradient descent direction was estimated by accessing the current and past average of the models from the benign nodes. The attack assumes a mean-like aggregation being used to optimize its λ value. Every node was made to perform one iteration of local training (l = 1) on its minibatch, after which one round of gossip was performed. We use the image datasets of MNIST and Fashion- MNIST Xiao et al. (2017) , and the NLP dataset of Shakespeare Caldas et al. (2019) for our experiments, trained with a DNN and an LSTM respectively. We report the test accuracy for the image dataset and the test perplexity (= 2 test loss ) for the NLP dataset as used in the standard practice. The training hyperparameters are described in Table 1 . The DNN used to train MNIST and FMNIST consists of two CNN layers with 30 and 50 channels respectively, comprising of 3x3 kernels, the two layers being separated by a 2x2 maxpool layer, and followed by two fully connected layers of 200 and 10 neurons. The LSTM used to train Shakespeare consists of 1 hidden layer with 128 neurons between the input and the output layers. For the defense, a default value of µ = 0.75 was used against the graybox attack and µ = 0 against the whitebox attack. Due to the small value of k used in simulations, we use the 0-percentile, that is the minimum flip-score value to set the upper bound of permissible flip-score. For comparison, the benign baseline of gossip averaging was used. In the malicious case, the baseline used was Trimmed Mean (TM) aggregation with c max,i = ⌈ c * ki m ⌉ for every node i where k i is its number of neighbors. This trims the c max,i number of received gradient values from both the higher and lower end out of the k i values received based on their magnitude for every model weight and bias. The gradients here refer to the difference between the current received model from a neighbor and one's own model from the past iteration. By default, we choose m = 128 and the data was distributed in a non-IID manner with a bias b of 0.5 for the image datasets, both of which had 10 classes. The nodes were divided into 10 groups and a data sample with label l was assigned to group index l with a probability b and to any other group with a probability 1-b 9 . Within the group, the data samples were distributed randomly among the clients. For the NLP dataset, the data was divided sequentially (which is also non-IID) into m + 1 chunks, one for each client and the last chunk for testing.

4. EVALUATION

We first demonstrate how P2PRISM stops the spread of a graybox attack -one that only knows that a mean-like aggregation is being used but is oblivious to the details of the defense; and helps the benign nodes recover. Then we show how P2PRISM keeps the consensus distance of the graph under control which is most necessary for decentralized peer-to-peer learning. We then proceed to stress test P2PRISM under a whitebox attack where the attacker also has access to the defense algorithm, its parameters, and the cutoff flip-score in the past iteration beyond which all models were penalized. We find that the flip-score generated by malicious clients in the graybox attack are easily distinguishable from the benign ones due to the flip-score value being too large than the benign values and the detection rate is not affected by the defense parameter µ at all. Therefore, we use the whitebox attack to show the effect of varying µ where the malicious and benign flip-scores are indistinguishable. Figure 2 shows the candle plot of test accuracy under different conditions. The lower and upper ends of the candle correspond to the 25-th and 75-th percentile final test accuracy (or test perplexity for Shakespeare dataset) respectively among the nodes, and the ends of the candle wicks correspond to the lowest and highest final test accuracy at the end of training. The attack completely damages P2PL without any secure aggregation to a random test accuracy of 10%. This is also evident from the level of impact that just 2 malicious nodes had in Figure 1 . We do not show that in the plots here. We see that Trimmed Mean fails against the attack because of two reasons -1) Magnitude based filtering is inefficient against a gradient ascent attack, and 2) a fixed value of c max although suitable for FL does not work for P2PL where the fraction of malicious nodes differs in every neighborhood in any practical case. We also observe a clear separation between the performance of Trimmed Mean and P2PRISM where P2PRISM is very close to the benign standard of Gossip Averaging, and is in some cases even completely overlapping with it. A small dip in the test accuracy could be explained from the fact that the malicious nodes attacked in every iteration and hence nothing was learned from their local data which did contribute in the completely benign case. It is important to observe that P2PRISM excels for both the graph types -k-regular and power-law across all the 3 datasets used. Figure 4 : The figures show the performance of P2PRISM against a whitebox attacker that crafts a stealthy attack with full knowledge of the defense and its parameters to bypass the prism used in the defense. We consider the attack successful if it drops the test accuracy on MNIST below 0.8, and on FMNIST below 0.65, or if it raises the test perplexity on Shakespeare above 8. We observe that with c = 16, P2PRISM was successful in preventing all benign nodes from being infected. With c = 32, P2PRISM could restrict the number of infected nodes below 6% for a k-regular graph and under 34% for a power-law graph. Next, we show how P2PRISM keeps the consensus among the benign nodes under control even in the presence of the attackers. In P2PL, every node can have a different model state, but as long as they are close to each other, they are all moving along the same trajectory towards the same global solution. We set the nodes to be at consensus initially. Naturally, as they train, they are going to drift from each other unless the graph is fully connected. The attackers try to spread dissensus among the benign nodes by infecting their neighbors. After a certain point, even if the attack is stopped, the existing dissensus may inhibit training as the nodes with disagreeing models collaborate. P2PRISM shows how, as in Figure 3 , by identifying and filtering the malicious models based on one's flipscore, such a dissensus scenario is prevented. Having shown the success of P2PRISM against a graybox attack where the attacker was not aware of the details pf the defense being used by the nodes, we present a whitebox attacker at time t that not only has access to the defense parameter µ, but also the minimum, median, and cutoff flip-score of all the clients at t -1. The attacker uses this information to craft the attack at time t by optimizing the attack parameter λ so that the expected flip-score at t will be less than the cutoff flip-score at time t -1 for the node being targeted. This is a highly personalized attack and the information from the past, that is, from t -1 is used because the median, minimum, and cutoff values at t is determined by a node only after receiving all the updates, while the attacker is yet to craft its attack. Given that every node just runs one iteration of local training, and the training trajectory is smooth, the estimation done by the attacker based on one timestep in the past is extremely accurate. The malicious and benign flip-scores are now indistinguishable and the defense is bypassed. However, we observe that this stealthy attack loses its damaging impact as a trade-off most of the time. We see in Figure 4 that P2PRISM could save 100% of the nodes from being infected when c/m = 16/128 = 1/8. However, on increasing the number of attackers to 32, that position themselves at random locations in the graph, the attackers do form a majority at some neighborhoods and are able to attack the benign nodes at such vulnerable locations. This number is very small though, as can be seen in Figure 4 , that the attacker could infect less than 6% of the benign nodes in a k-regular graph and less than 34% in a power-law graph. Although the power-law graph seems to be infected, it is only because the attack, being whitebox, stealthy, and controlling 1/4-th of the popular nodes becomes too powerful. This shows the limit of malice upto which P2PRISM can be expected to provide perfect security, after which, saving all nodes from being infected cannot be guaranteed. µ was set to zero to allow the defense to be conservative against an attacker which is too powerful. We take this opportunity to also demonstrate the effect of sweeping µ. With higher µ, the flip-score cutoff increases, thereby giving the attackers an opportunity to be stealthy even with higher attack magnitude λ. Thus, the stealthy attack may still be damaging as the detection threshold is not as sensitive anymore. We see the same trend in Figure 5 where the whitebox attackers are able to impact more and more nodes as µ increases, however this increase is too gradual and it reflects the robustness of P2PRISM. This also empirically shows why we have chosen a default value of µ = 0.75 for all our experiments with the graybox attacker. It is always recommended to use a low Figure 5 : The figures show the effect of sweeping the defense parameter µ against a whitebox attack on MNIST and FMNIST in a k-regular graph with m = 128, c = 32. With increasing µ, the cutoff flip-score increases for every node and a relatively stronger attack can get the opportunity to be stealthy and bypass P2PRISM. Hence, we see a gradual increase in percentage of infected nodes. However, even with a high value of mu = 4, this increase is very slow the fraction of infected nodes stay below 12%, thus proving the robustness of P2PRISM, that is, low sensitivity with respect to this parameter across a large sweep of its parameter. µ for a conservative defense, however, this experiment proves that the performance of P2PRISM is not too sensitive with respect to its parameter µ.

5. CONCLUSION

In this paper, we have presented a secure aggregation for nodes participating in collaborative learning as each others' peers. We began by introducing the concept of peer-to-peer learning in comparison with federated learning. We saw how the consensus distance at model initialization can significantly affect the training process and stressed on initializing all the participating nodes with the same model. This is an extremely important aspect of the training process but has been ignored in the context of decentralized collaborative learning. We also discussed how P2PL is differently affected by byzantine nodes as compared to FL depending on the graph connectivity, the number of gossiping steps per round, and the location of the malicious nodes. Unlike FL, an attacker can have full access to its neighbors' models via gossiping and can launch a personalized attack. At the same time, every node has access to its own trusted benign gradient updates to compare all incoming updates with, and unlike FL, a node can perform a trusted secure aggregation for itself, given a robust defense policy. We leveraged this fact to construct our own defense policy, P2PRISM based on the concept of flip-score that can protect a node even if its surrounded by a malicious majority in its neighborhood. We evaluated P2PRISM against the state-of-the-art model poisoning attack and compared it against the secure aggregation protocol of Trimmed Mean by extending it to the decentralized learning domain. P2PRISM could successfully recover all benign models under a graybox attacker, while also maintaining a low consensus distance among the benign nodes across all the three datasets of MNIST, FMNIST, and Shakespeare on both the k-regular and power-law graph topologies. This was followed by a stress test of P2PRISM against a whitebox attacker that also has access to the defense parameter in addition to access to all the benign models, that can launch a highly personalized attack to each of its neighbors. We saw that P2PRISM could recover all the being node even in the presence of this extremely powerful attacker when only 1/8-th of the nodes were malicious. However, when the fraction of malicious nodes go up to 1/4, some benign nodes begin to be infected, but their fraction is contained to less than 6% for a k-regular graph. With a power-law graph in the presence of such a powerful adversary controlling 1/4-th of all the nodes, P2PRISM could only save 66% of the nodes from being infected. We also swept the defense parameter under the presence of this whitebox attacker to show the robustness of P2PRISM across a large usable range of the parameter. P2PRISM could successfully contain the infection within 12% of the benign nodes. With this, we conclude the evaluation of P2PRISM and propose its usage at every node in a decentralized learning architecture for the best possible security against model poisoning attacks.

A APPENDIX

A.1 CONVERGENCE ANALYSIS Theorem 1: Here, we show that, under P2PRISM, the weighted mean of the local models converges in at least time K, that is, 1 4 K-1 k=0 E ∇f 1 n n l=1 w (k) l x (k) l 2 K ≤ 3 f 1 n n l w (0) l x (0) l -f * √ nK (5) where, K ≥ max          nL 4 1 C 4 (60) 2 (1 -q) 4 , nL 4 1 C 4 P 1 (1 -q) 2 f 1 n n l w (0) l x (0) l -f * √ nK 2 , nP 2 f 1 n n l w (0) l x (0) l -f * √ nK          (6) For Assumptions 1-3 to be valid, for the sake of analysis, we assume all n nodes to act benign within bounded variance, and we show that the standard decentralized algorithm, x (k) = 1 n n i=1 x (k) i achieves convergence under mixing matrices generated by P2PRISM. We modify the convergence analysis in Assran et al. (2019) by replacing a column stochastic mixing matrix with a row stochastic one, under the protocols proposed in P2PRISM. We describe the complete proof in great detail below. Theorem 2: Under the same assumptions in Theorem 1, 1 nK K-1 k=0 n i=1 E 1 n n l=1 w (k) l x (k) l -x (k) i 2 ≤ O 1 K + 1 K 3/2 , 1 K K-1 k=0 1 n n i=1 E ∇f x k i 2 ≤ O 1 √ nK + 1 K + 1 K 3/2 (7) That is, the parameters at each node converge as well. Lemma 3: Under Assumptions 1-3, letλ = 1 -nD -∆B and q = λ 1 ∆B+1 , then there exists a constant C < 2 √ dD ∆B λ ∆B+2 ∆B+1 1 + n 3/2 2 where d is the dimension of x (k) i for all i = 1, 2, ..., n and k ≥ 0. n is the number of nodes. The communication topology is B-strongly connected. To each mixing matrix P (k) , we can associate a graph with vertex set {1, . . . , n} and edge set E (k) = {(i, j) : w (k) i.j > 0}. B-strongly connected means that we can assume that there exists finite, positive integers, B and ∆, such that the graph with edge set k) is strongly connected and has diameter at most ∆ for every l ≥ 0. Then (l+1)B-1 k=lB E ( 1 n n l=1 w (k) l x (k) l -x (k) i 2 ≤ Cq k x (0) i 2 + γC k s=0 q k-s ∇F i x (s) i ; ξ (s) i 2 (8) Proof: We make modifications in Lemma 3 in Assran et al. (2019) to prove our inequality. It is to be noted that former analysis assumes column stochastic mixing matrices, which have the property that when premultiplied by a row of ones (1 T n ), the product equals 1 T n . Whereas, in our case, such an operation results in a vector w each element of which represents the column sum of the mixing matrix. We prove in the following lemmas how this modification affects the convergence analysis. Lemma 3 in Assran et al. (2019) says -Suppose that Assumption 3 (mixing connectivity) holds. Let λ = 1 -nD -(τ +1)∆B and let q = λ 1/((τ +1)∆B+1) where τ is the delay in asynchronous communication among the nodes. Then there exists a constant C < 2 √ dD (τ +1)∆B λ (τ +1)∆B+2 (τ +1)∆B+1 , where d is the dimension of x (k) , z i (k) , and x i (0) , such that, for all i = 1, 2, . . . , n (non-virtual nodes) and k ≥ 0 such that x (k) -z (k) i 2 ≤ Cq k x (0) i 2 + γC k s=0 q k-s ∇F i z (s) i ; ξ (s) i 2 (9) where z is the unbiased estimate of x in their problem formulation. This lemma itself follows from a small adaptation to Theorem 1 in Assran & Rabbat (2020) that is proven in Assran (2018) . We show the modifications we make in the former proof to achieve our result. It is shown in Assran (2018) that z (k+1) i - 1 ⊤ x (k) n 1 ≤ C δ min q k ∥x (0) ∥ 1 + C δ min k s=1 q k-s ∥η (s) ∥ 1 + 1 ⊤ x (k) n 1 q k ( ) where η is the local update made by a node after gossip averaging as shown in the expression below. x (1) =P (0) x (0) + η (1) x (2) =P (1) x (1) + η (2) =P (1) P (0) x (0) + P (1) η (1) + η (2) In general, x (k) =Π k-1 i=0 P (i) x (0) + k-1 i=1 Π k-1 j=i P (j) η (i) + η (k) ( ) If P is column stochastic, then 1 T P = 1 T . This fact is used to simplify the above equation by premultiplying LHS and RHS by 1 T to obtain s) . This when plugged in 10 leads to 9 with C = 2C δmin . We modify the above analysis by replacing 1 T by w (k) where n i=1 w (k) i = n. Premultiplying 11 by w (k) results in the equation 1 T x (k) = 1 T x (0) + k i=1 1 T η (i) , which leads to the inequality 1 T x (k) ≤ 1 T x (0) + k s=1 1 T η (s) ≤ n x (0) + n k s=1 η ( w (k) x (k) = w (k) Π k-1 i=0 P (i) x (0) + w (k) (k-1) i=1 Π k-1 j=i P (j) η (i) + w (k) η (k) We now use the fact that the product of two row-stochastic matrices is also row-stochastic, and we represent all products of matrices above by a single matrix, simplified as w (k) x (k) = w (k) P 0 x (0) + w (k) k-1 i=1 P i η (i) + w (k) η (k) Absorbing the rightmost term into the summation assuming an identity matrix P (that is also row stochastic) and applying the triangle inequality, we obtain - w (k) x (k) ≤ w (k) P 0 x (0) + k i=1 w (k) P i η (i) Upper bound on elements of P In our construction, the mixing matrix P is row stochastic, and every node gives equal weights to all its neighbors that pass its P RISM . Assuming every node chooses to aggregate the model from at least one of its neighbors in every iteration, allotting a weight of 1 2 to itself and the neighbor, therefore the upper limit on the column sum w i is n 2 for a node i. This happens when every node j chooses to aggregate only two models, that are, x i and x j , and allot of weight of 0 to all other nodes. Thus, if we consider the vector w ′ = wP , where P is any row-stochastic matrix and w is the column sum of any row-stochastic matrix, then the i -th element w ′ i equals j w j c i,j where c i is the i -th column of P . We have the j -th element of w ′ , j w j c i,j ≤ j w j j c i,j = n n 2 . Hence, for all such w, P , we have ∥wP ∥ 2 ≤ n n 2 2 2 , that is, ∥wP ∥ n ≤ n 5/2 2 . We obtain that ∥w s) . Plugging this in 10, we obtain back 9 with C = C δmin 1 + n 3/2 2 . Therefore, the lemma described in Assran et al. (2019) for column (k) x (k) ∥ n ≤ n 3/2 2 x (0) + n 3/2 2 k s=1 η ( Under review as a conference paper at ICLR 2023 stochastic matrices remains valid for a row-stochastic mixing matrix as well but for a different constant C. Now we substitute τ = 0 since our setup is synchronous to present the upper bound on our constant C, that is, C < 2 √ dD ∆B λ ∆B+2 ∆B+1 1 + n 3/2

2

, where λ = 1 -nD -∆B and q = λ 1 ∆B+1 . Since in our setup. z i = x i , we substitute it back to complete our proof. Lemma 4: (Bound of stochastic gradient). We have the following inequality under Assumptions 1 and 2: E ∇f i x (k) i 2 ≤ 3L 2 E x (k) i - 1 n n l=1 w (k) l x (k) l 2 +3ζ 2 +3E ∇f 1 n n l=1 w (k) l x (k) l 2 Proof: E ∇f i x (k) i 2 Cauchy-Schwartz ≤ 3E ∇f i x (k) i -∇f i 1 n n l=1 w (k) l x (k) l 2 + 3E ∇f i 1 n n l=1 w (k) l x (k) l -∇f 1 n n l=1 w (k) l x (k) l 2 + 3E ∇f 1 n n l=1 w (k) l x (k) l 2 L-smooth ≤ 3L 2 E x (k) i - 1 n n l=1 w (k) l x (k) l 2 + 3E ∇f i 1 n n l=1 w (k) l x (k) l -∇f 1 n n l=1 w (k) l x (k) l 2 + 3E ∇f 1 n n l=1 w (k) l x (k) l 2 Bounded Variance ≤ 3L 2 E x (k) i - 1 n n l=1 w (k) l x (k) l 2 + 3ζ 2 + 3E ∇f 1 n n l=1 w (k) l x (k) l 2 Lemma 5: Under Assumptions 1-3, we have Q (k) i = E 1 n n l w (k) l x (k) l -x (k) i 2 ≤ γ 2 4C 2 (1 -q) 2 + γ q k C 2 1 -q σ 2 + γ 2 12C 2 (1 -q) 2 + γ q k 3C 2 1 -q ζ 2 + γ 2 12L 2 C 2 1 -q + γq k 3L 2 C 2 k j=0 q k-j Q (j) i + γ 2 12C 2 1 -q + γq k 3C 2 k j=0 q k-j E ∇f n l w (j) l x (j) l 2 + q 2k C 2 + γq k 2C 2 1 -q x i (0) 2 . ( ) Proof: Let us define, Q (k) i =E 1 n n l w (k) l x (k) l -x i (k) 2 Lemma 3 ≤ E Cq k x i (0) + γC k s=0 q k-s ∇F i x (s) i ; ξ (s) i 2 =E Cq k x i (0) + γC k s=0 q k-s ∇F i x (s) i ; ξ (s) i -∇f i x (s) i + ∇f i x (s) i 2 ≤E(Cq k x i (0) a + γC k s=0 q k-s ∇F i x (s) i ; ξ (s) i -∇f i x (s) b + γC k s=0 q k-s ∇f i x (s) i c ) 2 Thus, using the above expressions of a, b and c we have that Q (k) i ≤ E a 2 + b 2 + c 2 + 2ab + 2bc + 2ac . Let us now obtain bounds for all of these quantities: a 2 =C 2 x i (0) 2 q 2k b 2 =γ 2 C 2 k j=0 q 2(k-j) ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + 2γ 2 C 2 k j=0 k s=j+1 q 2k-j-s ∇F i x (j) i ; ξ (j) i -∇f i x (j) i ∇F i x (s) i ; ξ (s) i -∇f i x (s) i c 2 c 2 =γ 2 C 2 k j=0 q 2(k-j) ∇f i x (j) i 2 + 2γ 2 C 2 k j=0 k s=j+1 q 2k-j-s ∇f i x (j) i ∇f i x (s) i c1 2ab =2γC 2 q k x (0) i k s=0 q k-s ∇F i x (s) i ; ξ (s) i -∇f i x (s) i 2ac =2γC 2 q k x (0) i k s=0 q k-s ∇f i x (s) i 2bc =2γ 2 C 2 k j=0 k s=0 q 2k-j-s ∇F i x (j) i ; ξ (j) i -∇f i x (j) i ∇f i x (s) i The expression b 1 is bounded as follows: b 1 = γ 2 C 2 k j=0 k s=j+1 q 2k-j-s 2 ∇F i x (j) i ; ξ (j) i -∇f i x (j) i ∇F i x (s) i ; ξ (s) i -∇f i x (s) i ≤ γ 2 C 2 k j=0 k s=j+1 q 2k-s-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 using 2ab ≤ a 2 + b 2 . + γ 2 C 2 k j=0 k s=j+1 q 2k-s-j ∇F i x (s) i ; ξ (s) i -∇f i x (s) i 2 ≤ γ 2 C 2 k j=0 k s=0 q 2k-s-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + γ 2 C 2 k j=0 k s=0 q 2k-s-j ∇F i x (s) i ; ξ (s) i -∇f i x (s) i 2 = γ 2 C 2 k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 k s=0 q k-s + γ 2 C 2 k s=0 q k-s ∇F i x (s) i ; ξ (s) i -∇f i x (s) i 2 k j=0 q k-j ≤ 1 1 -q γ 2 C 2 k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 using K k=0 r K ≤ 1 1 -r for r < 1 + 1 1 -q γ 2 C 2 k s=0 q k-s ∇F i x (s) i ; ξ (s) i -∇f i x (s) i 2 = 2 1 -q γ 2 C 2 k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 . ( ) Thus, b 2 = γ 2 C 2 k j=0 q 2(k-j) ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + b 1 ≤ γ 2 C 2 1 -q k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + b 1 (16) ≤ 3γ 2 C 2 1 -q k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 By identical construction we have c 2 ≤ 3γ 2 C 2 1 -q k j=0 q k-j ∇f i x (j) i 2 Now let us bound the products 2ab, 2ac and 2bc. 2ab =γC 2 q k k s=0 q k-s 2 x i (0) ∇F i x (s) i ; ξ (s) i -∇f i x (s) i ≤γC 2 q k k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + γC 2 q k k j=0 q k-j x (0) i 2 using 2ab ≤ a 2 + b 2 . ≤γC 2 q k k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + γC 2 x (0) i 2 1 -q q k using K k=0 r K ≤ 1 1 -r for r < 1 (18) By similar procedure, 2ac ≤ γC 2 q k k s=0 q k-s ∇f i x (s) i 2 + γC 2 x i (0) 2 1 -q q k (19) Finally, 2bc = γ 2 C 2 k j=0 k s=0 q 2k-j-s 2 ∇F i x (j) i ; ξ (j) i -∇f i x (j) i ∇f i x (s) i ≤ γ 2 C 2 k j=0 k s=0 q 2k-j-s ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + γ 2 C 2 k j=0 k s=0 q 2k-j-s ∇f i x (s) i 2 , = γ 2 C 2 k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 k s=0 q k-s + γ 2 C 2 k s=0 q k-s ∇f i x (s) i 2 k j=0 q k-j , ≤ γ 2 C 2 1 -q k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + γ 2 C 2 1 -q k s=0 q k-s ∇f i x (s) i 2 (20) By combining all of the above bounds together we obtain: Q (k) i ≤ E a 2 + b 2 + c 2 + 2ab + 2bc + 2ac ≤ E 4γ 2 C 2 1 -q k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 + E 4γ 2 C 2 1 -q k j=0 q k-j ∇f i x (j) i 2 + C 2 x i (0) 2 q 2k + 2γC 2 x (0) i 2 1 -q q k + EγC 2 q k k j=0 q k-j ∇f i x (j) i 2 + EγC 2 q k k j=0 q k-j ∇F i x (j) i ; ξ (j) i -∇f i x (j) i 2 After grouping terms together and using the upper bound of Lemma 4, we obtain Q (k) i ≤ γ 2 4C 2 (1 -q) 2 + γ q k C 2 1 -q σ 2 + q 2k C 2 + γq k 2C 2 1 -q x (0) i 2 . + γ 2 4C 2 1 -q + γq k C 2 k j=0 q k-j E ∇f i x (j) i 2 Lemma 4 ≤ γ 2 4C 2 (1 -q) 2 + γ q k C 2 1 -q σ 2 + q 2k C 2 + γq k 2C 2 1 -q x i (0) 2 + γ 2 12C 2 (1 -q) 2 + γq k 3C 2 1 -q ζ 2 + γ 2 12L 2 C 2 1 -q + γq k 3L 2 C 2 k j=0 q k-j Q (j) i + γ 2 12C 2 1 -q + γq k 3C 2 k j=0 q k-j E ∇f 1 n n l w (j) l x (j) l 2 (22) Having found a bound for the quantity Q (k) i , let us now present a lemma for bounding the quantity K-1 k=0 M (k) where K > 1 is a constant and M (k) is the average Q (k) i across all (non-virtual) nodes i ∈ [n]. That is, M (k) = 1 n n i=1 Q (k) i . Lemma 6: Let Assumptions 1-3 hold and let us define D 2 = 1 -γ 2 12L 2 C 2 (1-q) 2 -γ 3L 2 C 2 (1-q) 2 . Then, K-1 k=0 M (k) ≤ γ 2 4C 2 (1 -q) 2 D 2 σ 2 K + γ C 2 (1 -q) 2 D 2 σ 2 + γ 2 12C 2 (1 -q) 2 D 2 ζ 2 K + γ3C 2 (1 -q) 2 D 2 ζ 2 + C 2 (1 -q) 2 D 2 + γ 2C 2 (1 -q) 2 D 2 n i=1 x i (0) 2 n + γ 2 12C 2 (1 -q) 2 D 2 + γ 3C 2 (1 -q) 2 D 2 K-1 k=0 E ∇f 1 n n l w (k) l x (k) l 2 (23) Proof: Using the bound for Q (k) i , let us first bound its average across all nodes M (k) . M (k) = 1 n n i=1 Q (k) i Lemma 5 ≤ γ 2 4C 2 (1 -q) 2 + γ q k C 2 1 -q σ 2 + γ 2 12C 2 (1 -q) 2 + γq k 3C 2 1 -q ζ 2 + γ 2 12C 2 1 -q + γq k 3C 2 k j=0 q k-j E ∇f 1 n n l w (k) l x (k) l 2 + γ 2 12L 2 C 2 1 -q + γq k 3L 2 C 2 k j=0 q k-j M (j) + q 2k C 2 + γq k 2C 2 1 -q n i=1 x i (0) 2 n . At this point note that for any λ ∈ (0, 1), non-negative integer K ∈ N, and non-negative sequence β (j) k j=0 , it holds that K k=0 k j=0 λ k-j β (j) = β (0) λ K + λ K-1 + • • • + λ 0 + β (1) λ K-1 + λ K-2 + • • • + λ 0 + • • • + β (K) λ 0 ≤ 1 1 -λ K j=0 β (j) (25) Similarly, K k=0 λ k k j=0 λ k-j β (j) = K k=0 k j=0 λ 2k-j β (j) ≤ K k=0 k j=0 λ 2(k-j) β (j) (18) ≤ 1 1 -λ 2 K j=0 β (j) Now by summing from k = 0 to k = K -1 and using the bounds of 25 and 26 we obtain K-1 k=0 M (k) ≤ γ 2 4C 2 (1 -q) 2 σ 2 K + γ C 2 (1 -q) 2 σ 2 + γ 2 12C 2 (1 -q) 2 ζ 2 K + γ3C 2 1 -q ζ 2 + C 2 1 -q 2 + γ 2C 2 (1 -q) 2 n i=1 x (0) i 2 n + γ 2 12C 2 (1 -q) 2 + γ 3C 2 1 -q 2 K-1 k=0 E ∇f 1 n n l w (k) l x (k) l 2 + γ 2 12L 2 C 2 (1 -q) 2 + γ 3L 2 C 2 1 -q 2 K-1 k=0 M (k) . By rearranging: 1 -γ 2 12L 2 C 2 (1 -q) 2 -γ 3L 2 C 2 1 -q 2 K-1 k=0 M (k) ≤ γ 2 4C 2 (1 -q) 2 σ 2 K + γ C 2 (1 -q) 2 σ 2 + γ 2 12C 2 (1 -q) 2 ζ 2 K + γ3C 2 (1 -q) 2 ζ 2 + C 2 1 -q 2 + γ 2C 2 (1 -q) 2 n i=1 x (0) i 2 n + γ 2 12C 2 (1 -q) 2 + γ 3C 2 1 -q 2 K-1 k=0 E ∇f n l w (k) l x (k) l 2 Note that since q ∈ (0, 1) it holds that 1 1-q 2 ≤ 1 (1-q) 2 . Thus, 1 -γ 2 12L 2 C 2 (1 -q) 2 -γ 3L 2 C 2 (1 -q) 2 K-1 k=0 M (k) ≤ γ 2 4C 2 (1 -q) 2 σ 2 K + γ C 2 (1 -q) 2 σ 2 + γ 2 12C 2 (1 -q) 2 ζ 2 K + γ3C 2 (1 -q) 2 ζ 2 + C 2 (1 -q) 2 + γ 2C 2 (1 -q) 2 n i=1 x (0) i 2 n + γ 2 12C 2 (1 -q) 2 + γ 3C 2 (1 -q) 2 K-1 k=0 E ∇f 1 n 1 n n l w (k) l x (k) l 2 (29) Dividing both sides with D 2 = 1 -γ 2 12L 2 C 2 (1-q) 2 -γ 3L 2 C 2 (1-q) 2 completes the proof. Lemma 7: Under the definition of our problem and the Assumptions 1-3 we have that: (i) E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i n 2 = E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i -∇f i x (k) i n 2 + E ξ (k) i n i=1 w (k) i ∇f i x (k) i n 2 (ii) E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i -∇f i x (k) i n 2 ≤ σ 2 Proof for (i): E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i n 2 =E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i -∇f i x (k) i n + n i=1 w (k) i ∇f i x (k) i n 2 =E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i -∇f i x (k) i n 2 + E ξ (k) i n i=1 w (k) i ∇f i x (k) i n 2 + 2 n i=1 w (k) i E ξ (k) i ∇F i x (k) i ; ξ (k) i -∇f i x (k) i n , n i=1 w (k) i ∇f i x (k) i n =E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i -∇f i x (k) i n 2 + E ξ (k) i n i=1 w (k) i ∇f i x (k) i n 2 where in the last equality the inner product becomes zero from the fact that E ξ (k) i ∇F i x (k) i ; ξ (k) i = ∇f i x (k) i . Proof for (ii): E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i - n i=1 w (k) i ∇f i x (k) i n 2 = 1 n 2 E ξ (k) i n i=1 w (k) i ∇F i x (k) i ; ξ (k) i -∇f i x (k) i 2 = 1 n 2 n i=1 E ξ (k) i w (k) i 2 ∇F i x (k) i ; ξ (k) i -∇f i x (k) i 2 + 2 n 2 i̸ =j ⟨E ξ (k) i w (k) i 2 ∇F i x (k) i ; ξ (k) i -∇f i x (k) i , E ξ (k) j w (k) i 2 ∇F j x (k) j ; ξ (k) j -∇f j x (k) j ⟩ = 1 n 2 n i=1 E ξ (k) i w (k) i 2 ∇F i x (k) i ; ξ (k) i -∇f i x (k) i 2 ≤ 1 n 2 n i=1 w (k) i 2 σ 2 ≤ σ 2 n 2 n i=1 w (k) i 2 σ 2 n 2 n 2 = σ 2 Lemma 8: Let Assumptions 1-3 hold and let k) , where each element of w is the column sum of P. Since P is row-stochastic, that is, every row sum equals 1, hence the sum of all elements of P with n rows equals n. Therefore, n i w (k) = n. For a column-stochastic P, w (k) = 1 n as in the formulation in Assran et al. (2019) and Koloskova et al. (2020) , but we will be continuing our proof without assuming w (k) = 1 n, and show convergence. D 1 = 1 2 -L 2 2 12γ 2 C 2 +3γC 2 (1-q) 2 D2 and D 2 = 1 - γ 2 12L 2 C 2 (1-q) 2 -γ 3L 2 C 2 (1-q) 2 Proof: f x (k+1) = f X (k+1) 1 n n =f X (k) P (k) ⊤ 1 n -γ∇F X (k) , ξ (k) P (k) ⊤ 1 n n since X (k+1) = X (k) -γ∇F X (k) , ξ (k) P (k) T . P (k) ⊤ 1 n = w ( f x (k+1) =f X (k) w (k) n - γ∇F X (k) , ξ (k) w (k) n L-smooth ≤ f X (k) w (k) n -γ ∇f X (k) w (k) n , ∇F X (k) , ξ (k) w (k) n + Lγ 2 2 ∇F X (k) , ξ (k) w (k) n 2 Here we use the Lsmoothness inequality: f (y) ≤ f (x) + ∇f (x) T (y -x) + L 2 ∥y -x∥ 2 2 . Taking expectation on both sides conditioned on F k : E f X (k+1) 1 n n | F k ≤f X (k) w (k) n -γ ∇f X (k) w (k) n , ∇F X (k) w (k) n + E   Lγ 2 2 ∇F X (k) , ξ (k) w (k) n 2 | F k   Lemma 7[i] = f X (k) w (k) n -γ ∇f X (k) w (k) n , ∇F X (k) w (k) n + Lγ 2 2 E    n i=1 w (k) i ∇F i x (k) i ; ξ (k) i - n i=1 w (k) i ∇f i x (k) i n 2 | F k    + Lγ 2 2 E    n i=1 w (k) i ∇f i x (k) i n 2 | F k    Lemma 7[ii] = f X (k) w (k) n -γ ∇f X (k) w (k) n , ∇F X (k) w (k) n + Lγ 2 σ 2 2 + Lγ 2 2 E    n i=1 w (k) i ∇f i x (k) i n 2 | F k    =f X (k) w (k) n - γ 2 ∇f X (k) w (k) n 2 - γ 2 ∇F X (k) w (k) n 2 , + γ 2 ∇f X (k) w (k) n - ∇F X (k) w (k) n 2 + Lγ 2 σ 2 2 + Lγ 2 2 E    n i=1 w (k) i ∇f i x (k) i n 2 | F k    where in the last step we expand the inner product using the expression ∥a -b∥ 2 = ∥a∥ 2 + ∥b∥ 2 -2⟨a, b⟩. Now taking expectation with respect to F k and using the tower property, we get, . Thus we have, E f X (k+1) 1 n n ≤E f X (k) w (k) n - γ 2 E ∇f X (k) w (k) n 2 - γ 2 E   ∇F X (k) w (k) n 2   , + γ 2 E   ∇f X (k) w (k) n - ∇F X (k) w (k) n 2   + Lγ 2 σ 2 2 + Lγ 2 2 E    n i=1 w (k) i ∇f i x (k) i n 2    = E f X (k) w (k) n - γ 2 E ∇f X (k) w (k) n 2 - γ -Lγ 2 2 E   ∇F X (k) w (k) n 2   + γ 2 E   ∇f X (k) w (k) n - ∇F X (k) E ∇f X (k) w (k) n - ∇F (X (k) )w (k) n 2 E   ∇f X (k) w (k) n - ∇F X (k) w (k) n 2   =E    ∇f ( 1 n n l w (k) j x j ) - n i=1 w (k) i ∇f i x (k) i n 2    =E    1 n n i ∇f i ( 1 n n l w (k) l x l ) - n i=1 w (k) i ∇f i x (k) i n 2    =E    n i ∇f i ( 1 n n l w (k) l x l ) - n i=1 w (k) i ∇f i x (k) i n 2    =E   1 n n i ∇f i ( 1 n n l w (k) l x l ) -w (k) i ∇f i x (k) i 2   Jensen ≤ 1 n n i E   ∇f i ( 1 n n l w (k) l x l ) -w (k) i ∇f i x (k) i 2   Cauchy-Schwartz ≤ 1 n n i E   1 + max w (k) i 2 ∇f i ( 1 n n l w (k) l x l ) -∇f i x (k) i 2   ≤ 1 + n 2 E f X (k+1) 1 n n ≤ E f X (k) w (k) n - γ 2 E ∇f X (k) w (k) n 2 - γ -Lγ 2 2 E   ∇F X (k) w (k) n 2   + γL 2 1 2n n i=1 Q (k) i + Lγ 2 σ 2 2 (36) By rearranging, γ 2 E ∇f X (k) w (k) n 2 + γ -Lγ 2 2 E   ∇F X (k) w (k) n 2   ≤E f X (k) w (k) n -E f X (k+1) 1 n n + Lγ 2 σ 2 2 + γL 2 1 2n n i=1 Q (k) i Let us now sum from k = 0 to k = K -1: γ 2 K-1 k=0 E ∇f X (k) w (k) n 2 + γ -Lγ 2 2 K-1 k=0 E   ∇F X (k) w (k) n 2   ≤ K-1 k=0 E f X (k) w (k) n -E f X (k+1) w (k+1) n + K-1 k=0 Lγ 2 σ 2 2 + γL 2 1 2n K-1 k=0 n i=1 Q (k) i ≤E f X (0) w (0) n -E f X (k) w (k) n ≤f ( 1 n n l w (0) l x (0) l ) -f * + LKγ 2 σ 2 2 + γL 2 1 2 K-1 k=0 1 n n i=1 Q (k) i M k The last inequality holds because f * is the theoretical global infimum of our problem. Using the bound for the expression K-1 k=0 M k from Lemma 6, we obtain: γ 2 K-1 k=0 E ∇f X (k) w (k) n 2 + γ -Lγ 2 2 K-1 k=0 E   ∇F X (k) w (k) n 2   ≤f ( 1 n n l w (0) l x (0) l ) -f * + LKγ 2 σ 2 2 + γL 2 2 4γ 2 C 2 σ 2 K + γC 2 σ 2 (1 -q) 2 D 2 + γL 2 2 12γ 2 C 2 ζ 2 K + 3γC 2 ζ 2 (1 -q) 2 D 2 + γL 2 2 12γ 2 C 2 + 3γC 2 (1 -q) 2 D 2 K k=0 E ∇f n l w (k) l x (k) l 2 + γL 2 2 C 2 + 2γC 2 (1 -q) 2 D 2 n i=1 x (0) i 2 n ( ) By using L < L 1 and rearranging and dividing all terms by γK, we obtain: 1 K   1 2 - L 2 1 2 12γ 2 C 2 + 3γC 2 (1 -q) 2 D 2 K-1 k=0 E ∇f 1 n n l w (k) l x (k) l 2 + 1 -L 1 γ 2 K-1 k=0 E ∇F X (k) w (k) n 2   ≤ f 1 n n l w (0) l x (0) l -f * γK + L 1 γσ 2 2 + 4L 2 1 γ 2 C 2 σ 2 + 12L 2 1 γ 2 C 2 ζ 2 2(1 -q) 2 D 2 + γL 2 1 C 2 σ 2 + 3L 2 1 γC 2 ζ 2 2K(1 -q) 2 D 2 + L 2 1 C 2 + 2L 2 1 γC 2 2(1 -q) 2 D 2 K n i=1 x (0) i 2 n (40) By defining D 1 = 1 2 - L 2 1 2 12γ 2 C 2 +3γC 2 (1-q) 2 D2 , the proof is complete.



Figure 1: Left: Training a k-regular peer-to-peer graph with varying values of k with same and different model initialization.We see that it is necessary to initialize nodes with the same model to expect benefit from collaboration (k > 1). Right: Demonstrating the spread of the attack in a k-regular graph where only 2 out of 128 nodes (shown in red) are malicious. We can see that the larger the collaboration (k), the greater the spread of the attack if the aggregation used is insecure, such as gossip averaging in this case.

Figure 2: The figures show the final test accuracy (perplexity for Shakespeare) candles of the benign nodes comparing Trimmed Mean and P2PRISM under malicious conditions -c/m = 1/8 in the left and c/m = 1/4 in the right figure, along with the baseline benign results for Gossip Averaging in the left figure. For every dataset, the last bar corresponds to a power-law graph while the rest denote k-regular graphs.

Figure 3: The figures show how consensus distance among the benign nodes changes during training. We show the results on one image dataset -MNIST on the left and on an NLP dataset -Shakespeare on the right. The graph topology chosen was power-law to mimic a realistic setting with k = 8 for c = 16 and k = 16 for c = 32. It is very clear how P2PRISM keeps the consensus control low as compared to Trimmed mean in a malicious environment.

of nodes have only few connections. It is difficult to simulate a symmetric graph where both the in degree and out degree strictly follow the power law, hence an asymmetric graph was constructed with only the in degree of nodes adhering strictly to the power-law in order to test the effectiveness of P2PRISM aggregation. To simulate this, the minimum degree was chosen as user input k while the maximum in degree (deg 0 ) equals m/2, where m is the total number of nodes in the network. 1 random node would be simulated with in degree = deg 0 , 2 with deg 1 , 4 with deg 2 , and so on, where k = deg log2m < deg log2m-1 < ... < deg 0 = m/2 are in a geometric progression.

w (k) n

annex

Proof of Theorem 1: Let γ ≤ min{ (1-q) 2 60L 2 C 2 , 1}. Then,By substituting these bounds in Lemma 8, we obtain,Let us now substitute in the above expression γ = n/K. This can be done due to the lower bound on the total number of iterations K where guarantees that n/K ≤ minThenProof of Theorem 2: From Lemma 6 we have that:Using the assumptions of Theorem 1 and step size γ = n/K,Now using the above upper bound in 46 and the result of Theorem 1, we obtain:

