MULTIGRAPH TOPOLOGY DESIGN FOR CROSS-SILO FEDERATED LEARNING Anonymous

Abstract

Cross-silo federated learning utilizes a few hundred reliable data silos with highspeed access links to jointly train a model. While this approach becomes a popular setting in federated learning, designing a robust topology to reduce the training time is still an open problem. In this paper, we present a new multigraph topology for cross-silo federated learning. We first construct the multigraph using the overlay graph. We then parse this multigraph into different simple graphs with isolated nodes. The existence of isolated nodes allows us to perform model aggregation without waiting for other nodes, hence reducing the training time. We further propose a new distributed learning algorithm to use with our multigraph topology. The intensive experiments on public datasets show that our proposed method significantly reduces the training time compared with recent state-of-theart topologies while ensuring convergence and maintaining the accuracy.



In practice, federated learning is a promising research direction where we can utilize the effectiveness of machine learning methods while respecting the user's privacy. Key challenges in federated learning include model convergence, communication congestion, and imbalance of data distributions in different silos (Kairouz et al., 2019) . A popular federated training method is to set a central node that orchestrates the training process and aggregates contributions of all clients. Main limitation of this client-server approach is that the server node potentially represents a communication congestion point in the system, especially when the number of clients is large. To overcome this limitation, recent research has investigated the decentralized (or peer-to-peer) federated learning approach. In the aforementioned approach, the communication is done via peer-to-peer topology without the need for a central node. However, main challenge of decentralized federated learning is to achieve fast training time, while assuring model convergence and maintaining the model accuracy.



Figure 1: Comparison between different topologies on FEMNIST dataset and Exodus network (Miller et al., 2010). The accuracy and total wall-clock training time (or overhead time) are reported after 6, 400 communication rounds.

