CENTRAL SERVER FREE FEDERATED LEARNING OVER SINGLE-SIDED TRUST SOCIAL NETWORKS

Abstract

Federated learning has become increasingly important for modern machine learning, especially for data privacy-sensitive scenarios. Existing federated learning mostly adopts the central server-based architecture or centralized architecture. However, in many social network scenarios, centralized federated learning is not applicable (e.g., a central agent or server connecting all users may not exist, or the communication cost to the central server is not affordable). In this paper, we consider a generic setting: 1) the central server may not exist, and 2) the social network is unidirectional or of single-sided trust (i.e., user A trusts user B but user B may not trust user A). We propose a central server free federated learning algorithm, named Online Push-Sum (OPS) method, to handle this challenging but generic scenario. A rigorous regret analysis is also provided, which shows interesting results on how users can benefit from communication with trusted users in the federated learning scenario. This work builds upon the fundamental algorithm framework and theoretical guarantees for federated learning in the generic social network scenario.

1. INTRODUCTION

Federated learning has been well recognized as a framework able to protect data privacy Konečnỳ et al. (2016) ; Smith et al. (2017a) ; Yang et al. (2019) . State-of-the-art federated learning adopts the centralized network architecture where a centralized node collects the gradients sent from child agents to update the global model. Despite its simplicity, the centralized method suffers from communication and computational bottlenecks in the central node, especially for federated learning, where a large number of clients are usually involved. Moreover, to prevent reverse engineering of the user's identity, a certain amount of noise must be added to the gradient to protect user privacy, which partially sacrifices the efficiency and the accuracy Shokri and Shmatikov (2015) . To further protect the data privacy and avoid the communication bottleneck, the decentralized architecture has been recently proposed Vanhaesebrouck et al. (2017) ; Bellet et al. (2018) , where the centralized node has been removed, and each node only communicates with its neighbors (with mutual trust) by exchanging their local models. Exchanging local models is usually favored to the data privacy protection over sending private gradients because the local model is the aggregation or mixture of quite a large amount of data while the local gradient directly reflects only one or a batch of private data samples. Although advantages of decentralized architecture have been well recognized over the state-of-the-art method (its centralized counterpart), it usually can only be run on the network with mutual trusts. That is, two nodes (or users) can exchange their local models only if they trust each other reciprocally (e.g., node A may trust node B, but if node B does not trust node A, they cannot communicate). Given a social network, one can only use the edges with mutual trust to run decentralized federated learning algorithms. Two immediate drawbacks will be: (1) If all mutual trust edges do not form a connected network, the federated learning does not apply; (2) Removing all single-sided edges from the communication network could significantly reduce the efficiency of communication. These drawbacks lead to the question: How do we effectively utilize the single-sided trust edges under decentralized federated learning framework? In this paper, we consider the social network scenario, where the centralized network is unavailable (e.g., there does not exist a central node that can build up the connection with all users, or the centralized communication cost is not affordable). We make a minimal assumption on the social network: The data may come in a streaming fashion on each user node as the federated learning algorithm runs; the trust between users may be single-sided, where user A trusts user B, but user B may not trust user A ("trust" means "would like to send information to"). For the setting mentioned above, we develop a decentralized learning algorithm called online pushsum (OPS) which possesses the following features: • Only models rather than local gradients are exchanged among clients in our algorithm. This scheme can reduce the risk of exposing clients' data privacy Aono et al. (2017) . • Our algorithm removes some constraints imposed by typical decentralized methods, which makes it more flexible in allowing arbitrary network topology. Each node only needs to know its out neighbors instead of the global topology. • We provide the rigorous regret analysis for the proposed algorithm and specifically distinguish two components in the online loss function: the adversary component and the stochastic component, which can model clients' private data and internal connections between clients, respectively. Notation We adopt the following notation in this paper: • For random variable ξ (i) t subject to distribution D (i) t , we use Ξ n,T and D n,T to denote the set of random variables and distributions, respectively: Ξ n,T = ξ (i) t 1≤i≤n,1≤t≤T , D n,T = D (i) t 1≤i≤n,1≤t≤T . Notation Ξ n,T ∼ D n,T implies ξ (i) t ∼ D (i) t for any i ∈ [n] and t ∈ [T ]. • For a decentralized network with n nodes, we use W ∈ R n×n to present the confusion matrix, where W ij ≥ 0 is the weight that node i sends to node j (i, j ∈ [n]). N out i = {j ∈ [n] : W ij > 0} and N in i = {k ∈ [n] : W ki > 0} are also used for denoting the sets of in neighbors of and out neighbors of node i respectively. • Norm • denotes the 2 norm • 2 by default.

2. RELATED WORK

The concept of federated learning was first proposed in McMahan et al. (2016) , which advocates a novel learning setting that learns a shared model by aggregating locally-computed gradient updates without centralizing distributed data on devices. Early examples of research into federated learning also include Konečný et al. (2015; 2016) , and a widespread blog article posted by Google AI McMahan and Ramage (2017) . To address both statistical and system challenges, Smith et al. (2017b) and Caldas et al. (2018) propose a multi-task learning framework for federated learning and its related optimization algorithm, which extends early works SDCA Shalev-Shwartz and Zhang (2013); Yang (2013); Yang et al. (2013) and COCOA Jaggi et al. (2014) ; Ma et al. (2015) ; Smith et al. (2016) to the federated learning setting. Among these optimization methods, Federated Averaging (FedAvg), proposed by McMahan et al. (2016) , beats conventional synchronized mini-batch SGD regarding communication rounds as well as converges on non-IID and unbalanced data. Recent rigorous theoretical analysis Stich (2018) ; Wang and Joshi (2018) ; Yu et al. (2018) ; Lin et al. (2018) shows that FedAvg is a special case of averaging periodic SGD (also called "local SGD") which allows nodes to perform local updates and infrequent synchronization between them to communicate less while converging quickly. However, they cannot be applied to the single-sided trust network (asymmetric topology matrix). Decentralized learning is a typical parallel strategy where each worker is only required to communicate with its neighbors, which means the communication bottleneck (in the parameter server) is removed. It has already been proved that decentralized learning can outperform the traditional centralized learning when the worker number is comparably large under a poor network condition Lian et al. (2017) . There are two main types of decentralized learning algorithms: fixed network topology He et al. (2018) , and time-varying Nedić and Olshevsky (2015) ; Lian et al. (2018) during training. Wu et al. (2017) ; Shen et al. (2018) shows that the decentralized SGD would converge with a comparable convergence rate to the centralized algorithm with less communication to make large-scale model training feasible. Li et al. (2018) provides a systematic analysis of the decentralized learning pipeline. Online learning has been studied for decades. It is well known that the lower bounds of online optimization methods are O( √ T ) and O(log T ) for convex and strongly convex loss functions respectively Hazan et al. (2016); Shalev-Shwartz et al. (2012) . In recent years, due to the increasing volume of data, distributed online learning, especially decentralized methods, has attracted much attention. Examples of these works include Kamp et al. (2014) ; Shahrampour and Jadbabaie (2017) ; Lee et al. (2016) . Notably, Zhao et al. (2019) shares a similar problem definition and theoretical result as our paper. However, single-sided communication is not allowed in their setting, restricting their results.

3. PROBLEM SETTING

In this paper, we consider federated learning with n clients (a.k.a., nodes). Each client can be either an edge server or some other kind of computing device such as smart phone, which has local private data and the local machine learning model x i stored on it. We assume the topological structure of the network of these n nodes can be represented by a directed graph G = (nodes : 

Let x (i)

t denote the local model on the i-th node at iteration t. In each iteration, node i receives a new sample and computes a prediction for this new sample according to the current model x (i) t (e.g., it may recommend some items to the user in the online recommendation system). After that, a loss function, f i,t (•) associated with that new sample is received by node i. The typical goal of online learning is to minimize the regret, which is defined as the difference between the summation of the losses incurred by the nodes' prediction and the corresponding loss of the global optimal model x * : RT := T t=1 n i=1 f i,t (x (i) t ) -f i,t (x * ) , where x * = arg min x T t=1 n i=1 f i,t (x) is the optimal solution. However, here we consider a more general online setting: the loss function of the i-th node at iteration t is f i,t (•; ξ i,t ), which is additionally parametrized by a random variable ξ i,t . This ξ i,t is drawn from the distribution D i,t , and is mutually independent in terms of i and t, and we call this part as the stochastic component of loss function f i,t (•; ξ i,t ). The stochastic component can be utilized to characterize the internal randomness of nodes' data, and the potential connection among different nodes. For example, music preference may be impacted by popular trends on the Internet, which can be formulated by our model by letting D i,t ≡ D t for all i ∈ [n] with some time-varying distribution D t . On the other hand, function f i,t (•; •) is the adversarial component of the loss, which may include, for example, user's profile, location, etc. Therefore, the objective regret naturally becomes the expectation of all the past losses: R T := E Ξ n,T ∼D n,T T t=1 n i=1 f i,t (x (i) t ; ξ (i) t ) -f i,t (x * ; ξ (i) t ) (1) with x * = arg min x E Ξ n,T ∼D n,T T t=1 n i=1 f i,t (x; ξ (i) t ). One benefit of the above formulation is that it partially resolves the non-I.I.D. issue in federated learning. A fundamental assumption in many traditional distributed machine learning methods is that the data samples stored on all nodes are I.I.D., which fails to hold for federated learning since the data on each user's device is highly correlated to that user's preferences and habits. However, our formulation does not require the I.I.D. assumption to hold for the adversarial component at all. Even though the random samples for the stochastic component still need to be independent, they are allowed to be drawn from different distributions. Finally, one should note that online optimization also includes stochastic optimization (i.e., data samples are drawn from a fixed distribution) and offline optimization (i.e., data are already collected before optimization begins) as its typical cases Shalev-Shwartz et al. (2012) . Hence, our setting covers a wide range of applications.

4. ONLINE PUSH-SUM ALGORITHM

In this section, we define the construction of the confusion matrix and introduce the proposed algorithm.

4.1. CONSTRUCTION OF CONFUSION MATRIX

One important parameter of the algorithm is the confusion matrix W. W is a matrix depending on the network topology G, which means W ij = 0 if there is no directed edge (i, j) in G. If the value of W ij is large, the node i will have a stronger impact on node j. However, W still allows flexibility where users can specify their weights associated with existing edges, meaning that even if there is a physical connection between two nodes, the nodes can decide against using the channel. For example, even if (i, j) ∈ E, user still can set W ij = 0 if user i thinks node j is not trustworthy and therefore chooses to exclude the channel from i to j. Of course, there are still some constraints over W. W must be a row stochastic matrix (i.e., each entry in W is non-negative, and the summation of each row is 1). This assumption is different from the one in classical decentralized distributed optimization, which typically assumes W is symmetric and doubly stochastic (e.g., Duchi et al. ( 2011)) (i.e., the summations of both rows and columns are all 1). Such a requirement is quite restrictive, because not all networks admit a doubly stochastic matrix (Gharesifard and Cortés (2010) ), and relinquishing double stochasticity can introduce bias in optimization Ram et al. (2010) ; Tsianos and Rabbat (2012) . As a comparison, our assumption that W is row stochastic will avoid such concerns since any non-negative matrix with at least one positive entry on each row (which is already implied by the connectivity of the graph) can be easily normalized into row stochastic. The relaxation of this assumption is crucial for federated learning, considering that the federated learning system usually involves complex network topology due to its large number of clients. Moreover, since each node only needs to make sure the summation of its out-weights is 1, there is no need for it to be aware of the global network topology, which significantly benefits the implementation of the federated learning system. Meanwhile, requiring W to be symmetric rules out the possibility of using asymmetric network topology and adopting sing-sided trust, while our method does not have such restriction.

4.2. ALGORITHM DESCRIPTION

The proposed online push-sum algorithm is presented in Algorithm 1. The algorithm design mainly follows the pattern of push-sum algorithm Tsianos et al. (2012) , but here we further generalize it into the online setting. The algorithm mainly consists of three steps: 1. Local update: each client i applies the current local model x  (i) t+ 1 2 is computed; 2. Push: the weighted variable W ij z (i) t+ 1 2 is sent to j for all its out neighbors j; 3. Sum: all the received W ji z (j) t+ 1 2 is summed and normalized to obtain the new model x (i) t+1 . Algorithm 1 Online Push-Sum (OPS) Algorithm Require: Learning rate γ, number of iterations T , and the confusion matrix W. 1: Initialize x (i) 0 = z (i) 0 = 0, ω (i) 0 = 1 for all i ∈ [n] 2: for t = 0, 1, ..., T -1 do 3: // For all users (say the i-th node i ∈ [n]) 4: Apply local model x (i) t and suffer loss f i,t (x (i) t ; ξ (i) t ) 5: Locally computes the intermedia variable z (i) t+ 1 2 = z (i) t -γ∇f i,t x (i) t ; ξ (i) t 6: Send W ij z (i) t+ 1 2 , W ij ω (i) t to all j ∈ N out i 7: Update z (i) t+1 = k∈N in i W ki z (k) t+ 1 2 ω (i) t+1 = k∈N in i W ki ω (k) t x (i) t+1 = z (i) t+1 ω (i) t+1 8: end for 9: return x (i) T to node i It should be noted an auxiliary variables z (i) t+ 1 2 and z (i) t+1 are used in the algorithm. Actually, they are used in the algorithm to clarify the description but may be easily removed in the practical implementation. Besides, another variable ω (i) t+1 is also introduced, which is the normalizing factor of z (i) t+1 . ω (i) t+1 plays an important role in the push-sum algorithm, since W is not doubly stochastic in our setting, and it is possible that the total weight i receives does not equal to 1. The introduction of the normalizing factor ω (i) t helps the algorithm avoid issues brought by that W is not doubly stochastic. Furthermore, when W becomes doubly stochastic, it can be easily verified that ω (i) t ≡ 1 and x (i) t ≡ z (i) t for any i and t, then Algorithm 1 reduces to the distributed online gradient method proposed by Zhao et al. (2019) . In the algorithm, the local data, which is encoded in the gradient f i,t (x Shokri and Shmatikov (2015) , is only utilized in updating local model. What neighboring nodes exchanges are only limited to the local models. (i) t ; ξ t )

4.3. REGRET ANALYSIS

In this subsection, we provide regret bound analysis of OPS algorithm. Due to the limitation of space, the detail proof is deferred to the appendix. For convenience, we first denote F i,t (x) := E ξi,t∼Di,t f i,t (x; ξ i,t ). To carry out the analysis, the following assumptions are required: Assumption 1. We make the following assumptions throughout this paper: (1) The topological graph G is strongly connected; W is row stochastic; (2) For any i ∈ [n] and t ∈ [T ], the loss function f i,t (x; ξ i,t ) is convex in x; (3) The problem domain is bounded such that for any two vectors x and y we always have xy 2 ≤ R; (4) The norm of the expected gradient ∇F i,t (•) is bounded, i.e., there exist constant G > 0 such that ∇F i,t (x) 2 ≤ G 2 for any i, t and x; (5) The gradient variance is also bounded by σ 2 , namely, E ξi,t∼Di,t ∇f i,t (x; ξ i,t ) -∇F i,t (x) 2 ≤ σ 2 . Here constant G provides an upper bound for the adversarial component. On the other hand, σ measures the magnitude of stochasticity brought by the stochastic component. When σ = 0, the problem setting simply reduces back to normal distributed online learning. The strong connectivity assumption is necessary to ensure that the information can be exchanged between any two nodes. As for the convexity and the domain boundedness assumptions, they are quite common in online learning literature, such as Hazan et al. (2016) . Equipped with these assumptions, now we are ready to present the convergence result: Theorem 2. If we set γ = √ nR σ √ 1 + nC 2 + G √ nC 1 T , the regret of OPS can be bounded by: R T ≤ O nGR √ T + σR 1 + nC 2 √ nT , where C 1 and C 2 are two constants defined in the appendix. Note that when n = 1 and σ = 0, where the problem setting just reduces to normal online optimization, the implied regret bound O(GR √ T ) exactly matches the lower bound of online optimization Hazan et al. (2016) . Moreover, our result also matches the convergence rate of centralized online learning where q = 0 for fully connected networks. Hence, we can conclude that the OPS algorithm has optimal dependence on T . This bound has a linear dependence on the number of nodes n, but it is easy to understand. First, we have defined the regret to be the summation of the losses on all the nodes. Increasing n makes the regret naturally larger. Second, our federated learning setting is different from the typical distributed learning in that I.I.D. assumption does not hold here. Each node contains distinct local data that may be drawn from totally different distributions. Therefore, adding more nodes is not helpful for decreasing the regret of existing clients. Moreover, we also prove that the difference of the model x (i) t on each worker could be bounded using the following theorem: Theorem 3. If we set γ as (2), the difference of the model x (i) t on each worker admits a faster convergence rate than regret: 1 T n i T t=0 x (i) t+1 -z t+1 2 ≤O nGR + nRσ T . Hence, the models on all clients' devices will finally converge to the same one with rate O(1/T ).

4.4. PRIVACY PROTECTION

Our proposed algorithm has several advantages concerning privacy protection. First, as we have mentioned, OPS runs in a decentralized way and exchanges models instead of gradients or training samples, which is already proven effective for reducing the risk of privacy leakage Bellet et al. (2017) . Second, OPS runs in a decentralized and asymmetric fashion. These properties create difficulties for many attacking methods such as Nasr et al. (2018) . In order to infer the data of other clients, the attacker needs to know the reactions of other nodes after the attack is injected, which is impossible when the connections are single-sided. Even though the attack will spread among the whole network and finally return to the attacker, it is still hard for the attacker to distinguish whether the information he receives from its neighbors is already affected by the attack or not, since he is unaware of the global topology.

5. EXPERIMENTS

We compare the performance of our proposed Online Push-Sum (OPS) method with that of Decentralized Online Gradient method (DOL) and Centralized Online Gradient method (COL), and then evaluate the effectiveness of OPS in different network size and network topology density settings.

5.1. IMPLEMENTATION AND SETTINGS

We consider online logistic regression with squared 2 norm regularization: where regularization coefficient λ is set to 10 -4 . ξ i,t is the stochastic component of the function f i,t introduced in Section § 3, which is encoded in the random data sample (A i,t , y i,t ). We evaluate the learning performance by measuring the average loss f i,t (x; ξ i,t ) = log 1 + exp -y i,t A i,t x + λ 2 x 2 , 1 nT E Ξ n,T n i=1 T t=1 f i,t (x i,t ; ξ i,t ) , instead of using the dynamic regret (1) directly, since the optimal reference point x * is the same for all the methods. The learning rate γ in Algorithm 1 is tuned to be optimal for each dataset separately. The experiment implementation is based on Python 3.7.0, PyTorch 1.2.0, NetworkX 2.3, and scikitlearn 0.20.3. The source code along with other information concerning the experiment such as the setting of the hyper-parameters is provided in the supplementary materials. Dataset Experiments were run on two real-world public datasets: SUSYfoot_0 and Room-Occupancyfoot_1 . SUSY and Room-Occupancy are both large-scale binary classification datasets, containing 5,000,000 and 20,566 samples, respectively. Each dataset is split into two subsets: the stochastic data and the adversarial data. The stochastic data is generated by allocating a fraction of samples (e.g., 50% of the whole dataset) to nodes randomly and uniformly. The adversarial data is generated by conducting on the remaining dataset to produce n clusters and then allocating every cluster to a node. As we analyzed previously, only the scattered stochastic data can boost the model performance by intra-node communication. For each node, this pre-acquired data is transformed into streaming data to simulate online learning.

5.2. COMPARISON WITH DOL AND COL

To compare OPS with DOL and COL, a network size with 128 nodes and 20 nodes are selected for SUSY and Room-Occupancy, respectively. For COL, its confusion matrix W is fully-connected (doubly stochastic matrix). For DOL and OPS, they are run with the same network topology and the same row stochastic matrix (asymmetric confusion matrix) to maintain a fair comparison. Such asymmetric confusion is constructed by setting each node's number of neighbors as a random value which is smaller than a fixed upper bound and also ensures the strong connectivity of the whole network (this upper-bound neighbor number is set to 32 for the SUSY dataset, while 10 is set for the Room-Occupancy dataset). Since DOL typically requires the network to be the symmetric and doubly stochastic confusion matrix, DOL is run in two settings for comparison. In the first setting, in order to meet the assumption of the symmetry and doubly stochasticity, all unidirectional connections are removed in the confusion matrix so that the row stochastic confusion matrix degenerates into a doubly stochastic matrix. This setting is labeled as DOL-Symm in Figure 2 . In another setting, DOL is forced to run on the asymmetric network where each node naively aggregates its received models without considering whether its sending weights are equal to its receiving weights. DOL-Asymm is used to label this setting in Figure 2 . As illustrated in Figure 2 , in both two datasets, OPS outperforms DOL-Symm in the row stochastic confusion matrix. This demonstrates that incorporating unidirectional communication can help to boost the model performance. In other words, OPS gains better performance in the single-sided Although DOL-Asymm utilizes additional unidirectional connections, in some cases its performance is even worse than DOL-Symm (e.g., Figure 2a ). This phenomenon is most likely attributed to its simple aggregation pattern, which causes decreased performance in DOL-Asymm when removing the doubly stochastic matrix assumption. These two observations confirm the effectiveness of OPS in a row stochastic confusion matrix, which is consistent with our theoretical analysis. Comparing Figure 2c and Figure 2d , we also observe that when increasing the ratio of the stochastic component, the average loss (regret) becomes smaller. It is reasonable that OPS achieves slightly worse performance than COL because OPS works in a sparsely connected network where information exchanging is much less than COL. We use the COL as the baseline in all experiments. Only the number of iterations instead of the actual running time is considered in the experiment. It is redundant to present the actual running time. Because the centralized method requires more time for each iteration due to the network congestion in the central node, OPS usually outperforms COL in terms of running time.

5.3. EVALUATION ON DIFFERENT NETWORK SIZES

Figure 3a and 3b summarizes the evaluation of OPS in different network sizes (in the SUSY dataset, 128, 256, 512, 1024 are set). The upper-bound neighbor number is aligned to the same value among different network sizes to isolate its impact. As we can see, in every dataset, the average loss (regret) curve in different network sizes is close on a small scale. These observations demonstrate OPS is robust to the network size. Furthermore, the average loss (regret) is smaller in larger network size (i.e., the curve of the n = 1024 network size is lower than others), which also demonstrates that more stochastic samples provided by more nodes can naturally accelerate the convergence. Due to limitation of space, the results on the other dataset is deferred to the appendix.

5.4. EVALUATION ON NETWORK DENSITY

We also evaluate the performance of OPS in different network densities. We fix the network size to 512 for SUSY dataset. Network density is defined as the ratio of the upper-bound random neighbor number per node to the size of the network (e.g., if the ratio is 0.5 in SUSY, it means 256 is set as the upper-bound neighbor number for each node). We can see from Figure 3c and 3d that as the network density increased, the average loss (regret) decreased. This observation also proves that our proposed OPS algorithm can work well in different network densities, and can gain more benefits from a denser row stochastic matrix. This benefit can also be understood intuitively: in a federated learning network, a user's model performance will improve if it communicates with more users. The results of Room Occupancy are also deferred to the appendix.

A PROOFS

Notations: Below we use the following notation in our proof • ∇F t (X t ) := ∇F 1,t x (1) t , • • • , ∇F n,t x (n) t • X t := x (1) t , x t , ..., x (n) t • G t := ∇f 1,t (x 1 t ; ξ 1 t ), . . . , ∇f n,t (x n t ; ξ n t ) Here we first present the proof Theorem 2, then we will present some key lemmas along with the proof of Theorem 3. The following theorem is the key to prove Theorem 2: Theorem 4. For the online push-sum algorithm with step size γ > 0, it holds that R T ≤ G 2 T nγC 1 + σ 2 T γ(1 + nC 2 ) + nR 2 2γ , where C 1 := 8Cq δ min (1 -q) + 1, C 2 := 2Cq δ min (1 -q) , and C, q and δ min are some constants defined in later lemmas. Proof. Since the loss function f i,t (•) is assumed to be convex, which leads to E t n i=1 f i,t x (i) t ; ξ (i) t -nF t (x * ) =E t n i=1 f i,t x (i) t ; ξ (i) t -f i,t x * ; ξ (i) t ≤E t n i=1 ∇f i,t x (i) t ; ξ (i) t , x (i) t -x * = E t n i=1 ∇f i,t x (i) t ; ξ (i) t , x (i) t -z t :=I1t + E t n i=1 ∇f i,t x (i) t ; ξ (i) t , z t -x * :=I2t . For I 2t , we have E t n i=1 ∇f i,t x (i) t ; ξ (i) t , z t -x * = n γ E t γ n n i=1 ∇f i,t x (i) t ; ξ (i) t , z t -x * = n 2γ E t   γ n n i=1 ∇f i,t x (i) t ; ξ (i) t 2 + z t -x * 2 -z t -x * - γ n n i=1 ∇f i,t x (i) t ; ξ (i) t 2   = n 2γ E t   γ n n i=1 ∇f i,t x (i) t ; ξ (i) t 2 + z t -x * 2 -z t+1 -x * 2   ≤ n 2γ E t γ 2 G 2 + γ 2 σ 2 n + z t -x * 2 -z t+1 -x * 2 Notice that for COL, we have I 1t = 0 because x (i) t = z t . So for DOL, in order to bound I 1t , we need to bound the difference x (i) t -z t (using Lemma 8). E t n i=1 ∇f i,t x (i) t ; ξ (i) t , x (i) t -z t =E t n i=1 ∇F i,t (x (i) t ), x (i) t -z t ≤E t n i=1 α ∇F i,t x (i) t 2 + 1 α x (i) t -z t 2 . Summing up the inequality above from t = 1 to t = T , we get T t=1 E t n i=1 ∇f i,t x (i) t ; ξ (i) t , x (i) t -z t = T t=1 E t n i=1 ∇F i,t x (i) t , x (i) t -z t ≤ T t=1 E t n i=1 α ∇F i,t x (i) t 2 + 1 α x (i) t -z t 2 = T t=1 αE t ∇F t (X t ) 2 F + 1 α E t X t -z t 2 F ≤α T t=1 E t ∇F t (X t ) 2 F + 4γ 2 C 2 q 2 αδ 2 min (1 -q) 2 T t=1 E t G t 2 F ≤α T t=1 E t ∇F t (X t ) 2 F + 4γ 2 C 2 q 2 αδ 2 min (1 -q) 2 T t=1 E t ∇F t (X t ) 2 F + nσ 2 . Choosing α = 2γCq δmin(1-q) , we have T t=1 E t n i=1 ∇f i,t x (i) t ; ξ (i) t , x (i) t -z t ≤ 8nγCT qG 2 δ min (1 -q) + 2nγCqσ 2 T δ min (1 -q) So we have T t=1 E t n i=1 f i,t z (i) t ; ξ (i) t -nF (x * ) ≤ 8nγCT qG 2 δ min (1 -q) + 2γCqσ 2 T δ min (1 -q) + n 2nγ T t=1 γ 2 G 2 + γ 2 σ 2 n + E t z t -x * 2 -E t z t+1 -x * 2 ≤G 2 T nγ 8Cq δ min (1 -q) + 1 + σ 2 T γ 1 + 2nCq δ min (1 -q) + n 2γ T t=1 E t z t -x * 2 -E t z t+1 -x * 2 ≤G 2 T nγ 8Cq δ min (1 -q) + 1 + σ 2 T γ 1 + 2nCq δ min (1 -q) + nR 2 2γ =C 1 nG 2 T γ + (1 + nC 2 )σ 2 T γ + nR 2 2γ . Notice that Theorem 2 can be easily verified by setting γ = √ nR √ (1+nC2)σ 2 + √ nC1G 2 T . Next, we will present two lemmas for our proof of Lemma 8. The proofs of following two lemmas can be found in existing literature Nedić and Olshevsky (2014; 2016) ; Assran and Rabbat (2018) ; Assran et al. (2018) . Lemma 5. Under the Assumption 1, there exists a constant δ min > 0 such that for any t, the following holds n j=1 [W t W t ...W 0 ] ij ≥ δ min ≥ 1 n n , ∀i where W t is a row stochastic matrix. Lemma 6. Under the Assumption 1, for any t, there always exists a stochastic vector ψ(t) and two constants C = 4 and q = 1 -n -n < 1 such that for any s satisfying s ≤ t, the following inequality holds [W t W t • • • W s+1 W s ] ij -ψ i (t) ≤ Cq t-s , ∀i, j where W t is a row stochastic matrix, and ψ(t) is a vector with ψ i (t) being its i-th entry. Lemma 7. Given two non-negative sequences {a t } ∞ t=1 and {b t } ∞ t=1 that satisfying a t = t s=1 ρ t-s b s , with ρ ∈ [0, 1), we have D k := k t=1 a 2 t ≤ 1 (1 -ρ) 2 k s=1 b 2 s . Proof. From the definition, we have S k = k t=1 t s=1 ρ t-s b s = k s=1 k t=s ρ t-s b s = k s=1 k-s t=0 ρ t b s ≤ k s=1 b s 1 -ρ , D k = k t=1 t s=1 ρ t-s b s t r=1 ρ t-r b r = k t=1 t s=1 t r=1 ρ 2t-s-r b s b r ≤ k t=1 t s=1 t r=1 ρ 2t-s-r b 2 s + b 2 r 2 = k t=1 t s=1 t r=1 ρ 2t-s-r b 2 s ≤ 1 1 -ρ k t=1 t s=1 ρ t-s b 2 s ≤ 1 (1 -ρ) 2 k s=1 b 2 s . Based on the above three lemmas, we can obtain the following lemma. Lemma 8. Under the Assumption 1, the updating rule of Algorithm 1 leads to the following inequality n i T t=0 x (i) t+1 -z t+1 2 2 ≤ 4γ 2 C 2 q 2 δ 2 min (1 -q) 2 t s=0 G s 2 F , where γ is the step size, and C = 4, δ min ≥ n -n , q = 1 -n -n are constants. G s is the matrix for the stochastic gradient at time s (e.g., the i-th column is the stochastic gradient vector on node i at time s). Proof. The updating rule of OPS can be formulated as Z t+1 = (Z t -γG t ) W ω t+1 = W ω t X t+1 = Z t+1 [diag(ω t+1 )] -1 where W is a row stochastic matrix. X t = [x (1) t , x t , ..., x t ] is a matrix whose each column is x (i) t . G t is the matrix of gradient, whose each column is the stochastic gradient at z (i) t on node i. Z t = [z (1) t , ..., z (n) t ] is the matrix whose each column is z (i) t . Assuming X 0 = O and ω 0 = 1, then we have Z t+1 = (Z t -γG t ) W = ... = -γ t s=0 G s W t-s+1 , z t+1 = z t -γg t = ... = - t s=0 γg s , ω t+1 = W t+1 ω 0 , where x t = X t 1 is the average of all variables on the n nodes, and g t = G t 1 is the averaged gradient. We have W1 = 1 since W is a row stochastic matrix. For ω t+1 , according to Lemma 6, we decompose it as follows ω t+1 =W t+1 ω 0 = [W t+1 -ψ(t)1 ]ω 0 + ψ(t)1 ω 0 = [W t+1 -ψ(t)1 ]1 + nψ(t), since ω 0 = 1. On the other hand, according to Lemma 5, we also have ω (i) t+1 = [W t+1 1] e i = n j=1 [W t+1 ] ij ≥ nδ min , where e i is a vector with only the i-th entry being 1 and 0 for others. We need to further bound the following term x (i) t+1 -z t+1 =γ z (i) t+1 ω (i) t+1 -z t+1 =γ t s=0 G s W t-s+1 e i 1 W t+1 e i - G s 1 n =γ t s=0 nG s W t-s+1 e i -G s 11 W t+1 e i nω (i) t+1 , where the second equality is by ( 8), (9), and (10). We turn to bound the following term t s=0 nG s W t-s+1 e i -G s 11 W t+1 e i nω (i) t+1 ≤ 1 n 2 δ min t s=0 nG s W t-s+1 e i -G s 11 W t+1 e i , where the first inequality is accordng to (12). Therefore, combining the results above, we can have n i=1 x (i) t+1 -z t+1 2 2 ≤ γ 2 n 4 δ 2 min n i=1 t s=0 nG s W t-s+1 e i -G s 11 W t+1 e i 2 2 ≤ γ 2 n 4 δ 2 min t s=0 nG s W t-s+1 -G s 11 W t+1 2 F where the second inequality is due to n i=1 Ae i 2 2 = A 2 F . It remains to bound the following term t s=0 nG s W t-s+1 -G s 11 W t+1 2 F = t s=0 nG s W t-s+1 -G s 1[1 (W t+1 -ψ(t)1 ) + nψ(t) ] 2 F = t s=0 nG s [W t-s+1 -1ψ(t) ] -G s 11 [W t+1 -1ψ(t) ] 2 F ≤ t s=0 nG s [W t-s+1 -1ψ(t) ] F + t s=0 G s 11 [W t+1 -1ψ(t) ] F 2 ≤ n t s=0 G s F [W t-s+1 -1ψ(t) ] F + t s=0 G s F 11 F [W t+1 -1ψ(t) ] F 2 ≤n 2 t s=0 G s F [W t-s+1 -1ψ(t) ] F + t s=0 G s F [W t+1 -1ψ(t) ] F 2 ≤n 2 t s=0 nCq t-s+1 G s F + t s=0 nCq t+1 G s F 2 ≤4n 4 C 2 q 2 t s=0 q t-s G s F 2 where the third inequality is due to 11 F = n and the fourth inequality is by Lemma 6 and the fact that A F ≤ n • max i,j |A ij | if A ∈ R n×n . Therefore, if we combining all the above inequalities together, we can obtain n i=1 x (i) t+1 -z t+1 2 2 ≤ 4γ 2 C 2 q 2 δ 2 min t s=0 q t-s G s F 2 . Using Lemma 7, we have T t=0 t s=0 q t-s G s F 2 ≤ 1 (1 -q) 2 T t=0 G t 2 F , which leads to T t=0 n i=1 x (i) t+1 -z t+1 2 2 ≤ 4γ 2 C 2 q 2 δ 2 min (1 -q) 2 T t=0 G t 2 F , which completes the proof. Actually, Theorem 3 is a corollary of Lemma 8 by setting γ as the appropriate value.

B EXTRA EXPERIMENT RESULTS

B.1 EVALUATION ON Room Occupancy DATASET Due to the limitation of space, we only present the experiment results on SUSY dataset in Section 5.3 and 5.4. Related presents on Room Occupancy is shown in Figure 4 and Figure 5 . In Figure 4 , we vary the number of clients in the network, from 6 to 20. In Figure 5 , the network density is varied. All the results are consistent with the ones on SUSY. We run experiments in different ratios of the adversary and stochastic components based on settings in Figure 2 . As we can see in Figure 6 , we empirically prove that communication does have benefits in reducing regret. Moreover, as the ratio of the stochastic components increased, the regret of OPS decreases further. This also empirically proves that the stochastic component can benefit from the communication while the adversarial component does not. 



https://www.csie.ntu.edu.tw/ ˜cjlin/libsvmtools/datasets/binary.html# SUSY https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+ CONCLUSIONSDecentralized federated learning with single-sided trust is a promising framework for solving a wide range of problems. In this paper, the online push-sum algorithm is developed for this setting, which is able to handle complex network topology and is proven to have an optimal convergence rate. The regret-based online problem formulation also extends its applications. We tested the proposed OPS algorithm in various experiments, which have empirically justified its efficiency.



Figure 1: Different types of architectures.

[n], edges : E) with vertex set [n] = {1, 2, . . . , n} and edge set E ⊂ [n] × [n]. If there exist an edge (u, v) ∈ E, it means node u and node v have network connection and u can directly send messages to v.

(i) t to obtain the loss function, based on which an intermediate local model z

Figure 3: Evaluation on different network sizes and densities

Figure 4: Evaluation on the Network Sizes

Figure 6: Comparison between OPS and Local OGD.

