CROSS-NODE FEDERATED GRAPH NEURAL NETWORK FOR SPATIO-TEMPORAL DATA MODELING Anonymous authors Paper under double-blind review

Abstract

Vast amount of data generated from networks of sensors, wearables, and the Internet of Things (IoT) devices underscores the need for advanced modeling techniques that leverage the spatio-temporal structure of decentralized data due to the need for edge computation and licensing (data access) issues. While federated learning (FL) has emerged as a framework for model training without requiring direct data sharing and exchange, effectively modeling the complex spatiotemporal dependencies to improve forecasting capabilities still remains an open problem. On the other hand, state-of-the-art spatio-temporal forecasting models assume unfettered access to the data, neglecting constraints on data sharing. To bridge this gap, we propose a federated spatio-temporal model -Cross-Node Federated Graph Neural Network (CNFGNN) -which explicitly encodes the underlying graph structure using graph neural network (GNN)-based architecture under the constraint of cross-node federated learning, which requires that data in a network of nodes is generated locally on each node and remains decentralized. CNFGNN operates by disentangling the temporal dynamics modeling on devices and spatial dynamics on the server, utilizing alternating optimization to reduce the communication cost, facilitating computations on the edge devices. Experiments on the traffic flow forecasting task show that CNFGNN achieves the best forecasting performance in both transductive and inductive learning settings with no extra computation cost on edge devices, while incurring modest communication cost.

1. INTRODUCTION

Modeling the dynamics of spatio-temporal data generated from networks of edge devices or nodes (e.g. sensors, wearable devices and the Internet of Things (IoT) devices) is critical for various applications including traffic flow prediction (Li et al., 2018; Yu et al., 2018) , forecasting (Seo et al., 2019; Azencot et al., 2020) , and user activity detection (Yan et al., 2018; Liu et al., 2020) . While existing works on spatio-temporal dynamics modeling (Battaglia et al., 2016; Kipf et al., 2018; Battaglia et al., 2018) assume that the model is trained with centralized data gathered from all devices, the volume of data generated at these edge devices precludes the use of such centralized data processing, and calls for decentralized processing where computations on the edge can lead to significant gains in improving the latency. In addition, in case of spatio-temporal forecasting, the edge devices need to leverage the complex inter-dependencies to improve the prediction performance. Moreover, with increasing concerns about data privacy and its access restrictions due to existing licensing agreements, it is critical for spatio-temporal modeling to utilize decentralized data, yet leveraging the underlying relationships for improved performance. Although recent works in federated learning (FL) (Kairouz et al., 2019) provides a solution for training a model with decentralized data on multiple devices, these works either do not consider the inherent spatio-temporal dependencies (McMahan et al., 2017; Li et al., 2020b; Karimireddy et al., 2020) or only model it implicitly by imposing the graph structure in the regularization on model weights (Smith et al., 2017) , the latter of which suffers from the limitation of regularization based methods due to the assumption that graphs only encode similarity of nodes (Kipf & Welling, 2017) , and cannot operate in settings where only a fraction of devices are observed during training (inductive learning setting). As a result, there is a need for an architecture for spatio-temporal data modeling which enables reliable computation on the edge, while maintaining the data decentralized. To this end, leveraging recent works on federated learning (Kairouz et al., 2019) , we introduce the cross-node federated learning requirement to ensure that data generated locally at a node remains decentralized. Specifically, our architecture -Cross-Node Federated Graph Neural Network (CN-FGNN), aims to effectively model the complex spatio-temporal dependencies under the cross-node federated learning constraint. For this, CNFGNN decomposes the modeling of temporal and spatial dependencies using an encoder-decoder model on each device to extract the temporal features with local data, and a Graph Neural Network (GNN) based model on the server to capture spatial dependencies among devices. As compared to existing federated learning techniques that rely on regularization to incorporate spatial relationships, CNFGNN leverages an explicit graph structure using a graph neural networkbased (GNNs) architecture, which leads to performance gains. However, the federated learning (data sharing) constraint means that the GNN cannot be trained in a centralized manner, since each node can only access the data stored on itself. To address this, CNFGNN employs Split Learning (Singh et al., 2019) to train the spatial and temporal modules. Further, to alleviate the associated high communication cost incurred by Split Learning, we propose an alternating optimization-based training procedure of these modules, which incurs only half the communication overhead as compared to a comparable Split Learning architecture. Here, we also use Federated Averaging (FedAvg) (McMahan et al., 2017) to train a shared temporal feature extractor for all nodes, which leads to improved empirical performance. Our main contributions are as follows : 1. We propose Cross-Node Federated Graph Neural Network (CNFGNN), a GNN-based federated learning architecture that captures complex spatio-temporal relationships among multiple nodes while ensuring that the data generated locally remains decentralized at no extra computation cost at the edge devices. 2. Our modeling and training procedure enables GNN-based architectures to be used in federated learning settings. We achieve this by disentangling the modeling of local temporal dynamics on edge devices and spatial dynamics on the central server, and leverage an alternating optimization-based procedure for updating the spatial and temporal modules using Split Learning and Federated Averaging to enable effective GNN-based federated learning. 3. We demonstrate that CNFGNN achieves the best prediction performance (both in transductive and inductive settings) at no extra computation cost on edge devices with modest communication cost, as compared to the related techniques on a traffic flow prediction task.

2. RELATED WORK

Our method derives elements from graph neural networks, federated learning and privacy-preserving graph learning, we now discuss related works in these areas in relation to our work. Graph Neural Networks (GNNs). GNNs have shown their superior performance on various learning tasks with graph-structured data, including graph embedding (Hamilton et al., 2017) , node classification (Kipf & Welling, 2017) , spatio-temporal data modeling (Yan et al., 2018; Li et al., 2018; Yu et al., 2018) and multi-agent trajectory prediction (Battaglia et al., 2016; Kipf et al., 2018; Li et al., 2020a) . Recent GNN models (Hamilton et al., 2017; Ying et al., 2018; You et al., 2019; Huang et al., 2018) also have sampling strategies and are able to scale on large graphs. While GNNs enjoy the benefit from strong inductive bias (Battaglia et al., 2018; Xu et al., 2019) , most works require centralized data during the training and the inference processes. Federated Learning (FL). Federated learning is a machine learning setting where multiple clients train a model in collaboration with decentralized training data (Kairouz et al., 2019) . It requires that the raw data of each client is stored locally without any exchange or transfer. However, the decentralized training data comes at the cost of less utilization due to the heterogeneous distributions of data on clients and the lack of information exchange among clients. Various optimization algorithms have been developed for federated learning on non-IID and unbalanced data (McMahan et al., 2017; Li et al., 2020b; Karimireddy et al., 2020) . Smith et al. (2017) propose a multi-task learning framework that captures relationships amongst data. While the above works mitigate the caveat of missing neighbors' information to some extent, they are not as effective as GNN models and still suffer from the absence of feature exchange and aggregation. Alternating Optimization. Alternating optimization is a popular choice in non-convex optimization (Agarwal et al., 2014; Arora et al., 2014; 2015; Jain & Kar, 2017) . In the context of Federated Learning, Liang et al. (2020) uses alternating optimization for learning a simple global model and reduces the number of communicated parameters, and He et al. (2020) uses alternating optimization for knowledge distillation from server models to edge models. In our work, we utilize alternating optimization to effectively train on-device modules and the server module jointly, which captures temporal and spatial relationships respectively. Privacy-Preserving Graph Learning. 2020) preprocesses the input raw data with DP before feeding it into a GNN model. Composing privacy-preserving techniques for graph learning can help build federated learning systems following the privacy-in-depth principle, wherein the privacy properties degrade as gracefully as possible if one technique fails (Kairouz et al., 2019) .

3.1. PROBLEM FORMULATION

Given a dataset with a graph G = (V, E), a feature tensor X ∈ R |V|×... and a label tensor Y ∈ R |V|×... , we consider learning a model under the cross-node federated learning constraint: node feature x i = X i,... , node label y i = Y i,... , and model output ŷi are only visible to the node i. One typical task that requires the cross-node federated learning constraint is the prediction of spatiotemporal data generated by a network of sensors. In such a scenario, V is the set of sensors and E describes relations among sensors (e.g. e ij ∈ E if and only if the distance between v i and v j is below some threshold). The feature tensor x i ∈ R m×D represents the i-th sensor's records in the D-dim space during the past m time steps, and the label y i ∈ R n×D represents the i-th sensor's records in the future n time steps. Since records collected on different sensors owned by different users/organizations may not be allowed to be shared due to the need for edge computation or licensing issues on data access, it is necessary to design an algorithm modeling the spatio-temporal relation without any direct exchange of node-level data.

3.2. PROPOSED METHOD

We now introduce our proposed Cross-Node Federated Graph Neural Network (CNFGNN) model. Here, we begin by disentangling the modeling of node-level temporal dynamics and server-level spatial dynamics as follows: (i) (Figure 1c ) on each node, an encoder-decoder model extracts temporal features from data on the node and makes predictions; (ii) (Figure 1b ) on the central server, a Graph Network (GN) (Battaglia et al., 2018) propagates extracted node temporal features and outputs node embeddings, which incorporate the relationship information amongst nodes. (i) has access to the not shareable node data and is executed on each node locally. (ii) only involves the upload and download of smashed features and gradients instead of the raw data on nodes. This decomposition enables the exchange and aggregation of node information under the cross-node federated learning constraint.

3.2.1. MODELING OF NODE-LEVEL TEMPORAL DYNAMICS

We modify the Gated Recurrent Unit (GRU) based encoder-decoder architecture in (Cho et al., 2014) for the modeling of node-level temporal dynamics on each node. Given an input sequence x i ∈ R m×D on the i-th node, an encoder sequentially reads the whole sequence and outputs the Server Node ... 

GN GN

FedAvg: (1) (2) (3) hidden state h c,i as the summary of the input sequence according to Equation 1. h c,i = Encoder i (x i , h (0) c,i ), where h (0) c,i is a zero-valued initial hidden state vector. To incorporate the spatial dynamics into the prediction model of each node, we concatenate h c,i with the node embedding h G,c,i generated from the procedure described in 3.2.2, which contains spatial information, as the initial state vector of the decoder. The decoder generates the prediction ŷi in an auto-regressive way starting from the last frame of the input sequence x i,m with the concatenated hidden state vector. ŷi = Decoder i (x i,m , [h c,i ; h G,c,i ]). (2) We choose the mean squared error (MSE) between the prediction and the ground truth values as the loss function, which is evaluated on each node locally.

3.2.2. MODELING OF SPATIAL DYNAMICS

To capture the complex spatial dynamics, we adopt Graph Networks (GNs) proposed in (Battaglia et al., 2018) to generate node embeddings containing the relational information of all nodes. The central server collects the hidden state from all nodes {h c,i | i ∈ V} as the input to the GN. Each layer of GN updates the input features as follows: e k = φ e (e k , v r k , v s k , u) e i = ρ e→v (E i ) v i = φ v (e i , v i , u) e = ρ e→u (E ) u = φ u (e , v , u) v = ρ v→u (V ) , Algorithm 1 Training algorithm of CNFGNN on the server side.

Server executes:

1: Initialize server-side GN weights θ (0) GN , client model weights θ(0) c = { θ(0),enc c , θ(0),dec c }. 2: for each node i ∈ V in parallel do 3: Initialize client model θ (0) c,i = θ(0) c . 4: Initialize graph encoding on node h G,c,i = h (0) G,c,i . 5: end for 6: for global round r g = 1, 2, . . . , R g do 7: // (1) Federated learning of on-node models.

8:

for each client i ∈ V in parallel do 9: θ c,i ← ClientUpdate(i). ).

23:

for each client i ∈ V in parallel do 24: ∇ h G,c,i i ← ClientBackward( i,h G,c,i ). 25: ∇ θ (rg ,rs-1) GN i ← h G,c,i .backward( ∇ h G,c,i i ). 26: end for 27:  ∇ θ (rg ,rs-1) GN ← i∈V ∇ θ (rg ,rs-1) GN i . 28: θ (rg,rs) GN ← θ (rg,rs-1) GN -η s ∇ θ (rg , {h G,c,i |i ∈ V} ← GN ({h c,i |i ∈ V}; θ GN ).

33:

for each client i ∈ V in parallel do 34: Set graph encoding on client as h G,c,i . 

ClientUpdate(i):

1: for client round r c = 1, 2, . . . , R c do 2: h (rc) c,i ← Encoder i (x i ; θ (rc-1),enc c,i ). 3: ŷi ← Decoder i ( x i,m , [h (rc) c,i ; h G,c,i ]; θ (rc-1),dec c,i ).

4:

i ← ( ŷi , y). 5: θ (rc) c,i ← θ (rc-1) c,i -η c ∇ θ (rc -1) c,i i . 6: end for 7: θ c,i = θ (Rc) c,i . 8: return θ c,i to server.

ClientEncode(i):

1: return h c,i = Encoder i (x i ; θ enc c,i ) to server. ClientBackward(i, h G,c,i ): 1: ŷi ← Decoder i (x i,m , [h c,i ; h G,c,i ]; θ dec c,i ). 2: i ← ( ŷi , y). 3: return ∇ h G,c,i i to server. where e k , v i , u are edge features, node features and global features respectively. φ e , φ v , φ u are neural networks. ρ e→v , ρ e→u , ρ v→u are aggregation functions such as summation. As shown in Figure 1b , we choose a 2-layer GN with residual connections for all experiments. We set v i = h c,i , e k = W r k ,s k (W is the adjacency matrix) , and assign the empty vector to u as the input of the first GN layer. The server-side GN outputs embeddings {h G,c,i | i ∈ V} for all nodes, and sends the embedding of each node correspondingly.

3.2.3. ALTERNATING TRAINING OF NODE-LEVEL AND SPATIAL MODELS

One challenge brought about by the cross-node federated learning requirement and the server-side GN model is the high communication cost in the training stage. Since we distribute different parts of the model on different devices, Split Learning proposed by (Singh et al., 2019) is a potential solution To more effectively extract temporal features from each node, we also train the encoder-decoder models on nodes with the FedAvg algorithm proposed in (McMahan et al., 2017) . This enables all nodes to share the same feature extractor and thus share a joint hidden space of temporal features, which avoids the potential overfitting of models on nodes and demonstrates faster convergence and better prediction performance empirically.

4. EXPERIMENTS

We evaluate the performance of CNFGNN and all baseline methods on the traffic forecasting task, which is an important application for spatio-temporal data modeling. We reuse the following two real-world large-scale datasets in (Li et al., 2018) and follow the same preprocessing procedures: (1) PEMS-BAY: This dataset contains the traffic speed readings from 325 sensors in the Bay Area over 6 months from Jan 1st, 2017 to May 31st, 2017. (2) METR-LA: This dataset contains the traffic speed readings from 207 loop detectors installed on the highway of Los Angeles County over 4 months from Mar 1st, 2012 to Jun 30th, 2012. For both datasets, we construct the adjacency matrix of sensors using the Gaussian kernel with a threshold: W i,j = d i,j if d i,j >= κ else 0, where d i,j = exp (- dist(vi,vj ) 2 σ 2 ), dist(v i , v j ) is the road network distance from sensor v i to sensor v j , σ is the standard deviation of distances and κ is the threshold. We set κ = 0.1 for both datasets. We aggregate traffic speed readings in both datasets into 5-minute windows and truncate the whole sequence to multiple sequences with length 24. The forecasting task is to predict the traffic speed in the following 12 steps of each sequence given the first 12 steps. We show the statistics of both datasets in Table 1 .

4.1. SPATIO-TEMPORAL DATA MODELING: TRAFFIC FLOW FORECASTING

Baselines We compare CNFGNN with the following baselines. ( 1) GRU (centralized): a Gated Recurrent Unit (GRU) model trained with centralized sensor data. (2) GRU + GN (centralized): a model directly combining GRU and GN trained with centralized data, whose architecture is similar to CNFGNN but all GRU modules on nodes always share the same weights. We see its performance as the upper bound of the performance of CNFGNN. ( 3) GRU (local): for each node we train a GRU model with only the local data on it. (4) GRU + FedAvg: a GRU model trained with the Federated Averaging algorithm (McMahan et al., 2017) . ( 5) GRU + FMTL: for each node we train a GRU model using the federated multi-task learning (FMTL) with cluster regularization (Smith et al., 2017) given by the adjacency matrix. For each baseline, we have 2 variants of the GRU model to show the effect of on-device model complexity: one with 63K parameters and the other with 727K parameters. For CNFGNN, the encoder-decoder model on each node has 64K parameters and the GN model has 1M parameters. Table 3 : Comparison of the computation cost on edge devices and the communication cost. We use the amount of floating point operations (FLOPS) to measure the computational cost of models on edge devices. We also show the total size of data/parameters transmitted in the training stage (Train Comm Cost) Secondly, both the GRU+FMTL baseline and CNFGNN consider the spatial relations among nodes and show better forecasting performance than baselines without relation information. This shows that the modeling of spatial dependencies is critical for the forecasting task. Lastly, CNFGNN achieves the lowest forecasting error on both datasets. The baselines that increases the complexity of on-device models (GRU (727K) + FMTL) gains slight or even no improvement at the cost of higher computation cost on edge devices and larger communication cost. However, due to its effective modeling of spatial dependencies in data, CNFGNN not only has the largest improvement of forecasting performance, but also keeps the computation cost on devices almost unchanged and maintains modest communication cost compared to baselines increasing the model complexity on devices.

4.2. INDUCTIVE LEARNING ON UNSEEN NODES

Set-up Another advantage of CNFGNN is that it can conduct inductive learning and generalize to larger graphs with nodes unobserved during the training stage. We evaluate the performance of CNFGNN under the following inductive learning setting: for each dataset, we first sort all sensors based on longitudes, then use the subgraph on the first η% of sensors to train the model and evaluate the trained model on the entire graph. For each dataset we select η% = 25%, 50%, 75%. Over all baselines following the cross-node federated learning constraint, GRU (local) and GRU + FMTL requires training new models on unseen nodes and only GRU + FedAvg is applicable to the inductive learning setting. Discussion Table 4 shows the performance of inductive learning of CNFGNN and GRU + FedAvg baseline on both datasets. We observe that under most settings, CNFGNN outperforms the GRU + FedAvg baseline (except on the METR-LA dataset with 25% nodes observed in training, where both models perform similarly), showing that CNFGNN has the stronger ability of generalization. (2) Split Learning (SL): CNFGNN trained with split learning (Singh et al., 2019) , where models on nodes and the model on the server are jointly trained by exchanging hidden vectors and gradients. 5 shows their prediction performance and the com-munication cost in training. We notice that (1) SL suffers from suboptimal prediction performance and high communication costs on both datasets; SL + FedAvg does not have consistent results on both datasets and its performance is always inferior to AT + FedAvg. AT + FedAvg consistently outperforms other baselines on both datasets, including its variant without FedAvg. (2) AT + Fe-dAvg has the lowest communication cost on METR-LA and the 2nd lowest communication cost on PEMS-BAY, on which the baseline with the lowest communication cost (SL + FedAvg) has a much higher prediction error (4.383 vs 3.822). Both illustrate that our proposed training strategy, SL + FedAvg, achieves the best prediction performance as well as low communication cost compared to other baseline strategies.

4.4. ABLATION STUDY: EFFECT OF CLIENT ROUNDS AND SERVER ROUNDS

Set-up We further investigate the effect of different compositions of the number of client rounds (R s ) in Algorithm 2 and the number of server rounds (R c ) in Algorithm 1. To this end, we vary both R c and R s over [1, 10, 20] . Discussion Figure 3 shows the forecasting performance (measured with RMSE) and the total communication cost in the training of CN-FGNN under all compositions of (R c , R s ) on the METR-LA dataset. We observe that: (1) Models with lower R c /R s ratios (R c /R s < 0.5) tend to have lower forecasting errors while models with higher R c /R s ratios (R c /R s > 2) have lower communication cost in training. This is because the lower ratio of R c /R s encourages more frequent exchange of node information at the expense of higher communication cost, while the higher ratio of R c /R s acts in the opposite way. 0.5 R c /R s 2 R c /R s < 0.5 R c /R s > 2 (2) Models with similar R c /R s ratios have similar communication costs, while those with lower R c values perform better, corroborating our observation in (1) that frequent node information exchange improves the forecasting performance.

5. CONCLUSION

We propose Cross-Node Federated Graph Neural Network (CNFGNN), which bridges the gap between modeling complex spatio-temporal data and decentralized data processing by enabling the use of graph neural networks (GNNs) in the federated learning setting. We accomplish this by decoupling the learning of local temporal models and the server-side spatial model using alternating optimization of spatial and temporal modules based on split learning and federated averaging. Our experimental results on traffic flow prediction on two real-world datasets show superior performance as compared to competing techniques. Our future work includes applying existing GNN models with sampling strategies and integrating them into CNFGNN for large-scale graphs, extending CN-FGNN to a fully decentralized framework, and incorporating existing privacy-preserving methods for graph learning to CNFGNN, to enhance federated learning of spatio-temporal dynamics. 



Overview of the training procedure.

Encoder-decoder on the i-th node.

Figure 1: Cross-Node Federated Graph Neural Network. (a) In each round of training, we alternately train models on nodes and the model on the server. More specifically, we sequentially execute: (1) Federated learning of on-node models. (2) Temporal encoding update. (3) Split Learning of GN. (4) On-node graph embedding update. (b) Detailed view of the server-side GN model for modeling spatial dependencies in data. (c) Detailed view of the encoder-decoder model on the i-th node.

for each client i ∈ V in parallel do 17: h c,i ← ClientEncode(i).

for server round r s = 1, 2, . . . , R s do22: {h G,c,i |i ∈ V} ← GN ({h c,i |i ∈ V}; θ (rg,rs-1) GN

Training algorithm of CNFGNN on the client side.

Figure 2: Validation loss during the training stage of different training strategies.

(3) Split Learning + FedAvg (SL + FedAvg): A variant of SL that synchronizes the weights of encoderdecoder modules periodically with FedAvg. (4) Alternating training without Federated Averaging of models on nodes (AT, w/o FedAvg). (5) Alternating training with Federated Averaging on nodes described in Section 3.2.3 (AT + FedAvg). Discussion Figure 2 shows the validation loss during training of different training strategies on PEMS-BAY and METR-LA datasets, and Table

Figure 3: Effect of client rounds and server rounds (R c , R s ) on forecasting performance and communication cost.

Figure A1: Visualization of subgraphs visible in training under different ratios.

Statistics of datasets PEMS-BAY and METR-LA. where hidden vectors and gradients are communicated among devices. However, when we simply train the model end-to-end via Split Learning, the central server needs to receive hidden states from all nodes and to send node embeddings to all nodes in the forward propagation, then it must receive gradients of node embeddings from all nodes and send back gradients of hidden states to all nodes in the backward propagation. Assume all hidden states and node embeddings have the same size S, the total amount of data transmitted in each training round of the GN model is 4|V|S.To alleviate the high communication cost in the training stage, we instead alternately train models on nodes and the GN model on the server. More specifically, in each round of training, we (1) fix the node embedding h G,c,i and optimize the encoder-decoder model for R c rounds, then (2) we optimize the GN model while fixing all models on nodes. Since models on nodes are fixed, h c,i stays constant during the training of the GN model, and the server only needs to fetch h c,i from nodes before the training of GN starts and only to communicate node embeddings and gradients. Therefore, the average amount of data transmitted in each round for Rs rounds of training of the GN model reduces to 2+2RsRs |V|S. We provide more details of the training procedure in Algorithm 1 and Algorithm 2.

until the model reaches its lowest validation error. Comparison of performance on the traffic flow forecasting task. We use the Rooted Mean Squared Error (RMSE) to evaluate the forecasting performance.

Inductive learning performance measured with rooted mean squared error (RMSE).

Comparison of test error (RMSE) and the communication cost during training of different training strategies of CNFGNN.

Parameters used for calculating the communication cost of CNFGNN (SL).

Parameters used for calculating the communication cost of CNFGNN (SL + FedAvg).

Parameters used for calculating the communication cost of CNFGNN (AT, w/o FedAvg).

Inductive learning performance measured with rooted mean squared error (RMSE).visible nodes in training is extremely low (5%), there is not enough spatial relationship information in the training data to train the GN module in CNFGNN, and the performance of CNFGNN may not be ideal. We visualize the subgraphs visible in training under different ratios in FigureA1. However, as long as the training data covers a moderate portion of the spatial information of the whole graph, CNFGNN can still leverage the learned spatial connections among nodes effectively and outperforms GRU+FedAvg. We empirically show that the necessary ratio can vary for different datasets (25% for PEMS-BAY and 50% for METR-LA).A.4 THE HISTOGRAMS OF DATA ON DIFFERENT NODESWe show the histograms of traffic speed on different nodes of PEMS-BAY and METR-LA in Fig-ure A2. For each dataset, we only show the first 100 nodes ranked by their IDs for simplicity. The histograms show that the data distribution varies with nodes, thus data on different nodes are not independent and identically distributed.

A APPENDIX

A.1 DETAILED EXPERIMENT SETTINGS Unless noted otherwise, all models are optimized using the Adam optimizer with the learning rate 1e-3. GRU (centralized) : Gated Recurrent Unit (GRU) model trained with centralized sensor data. The GRU model with 63K parameters is a 1-layer GRU with hidden dimension 100, and the GRU model with 727K parameters is a 2-layer GRU with hidden dimension 200.

GRU (local)

We train one GRU model for each node with the local data only.

GRU + FedAvg

We train a single GRU model with Federated Averaging (McMahan et al., 2017) . We select 1 as the number of local epochs.

GRU + FMTL

We train one GRU model for each node using the federated multi-task learning (FMTL) with cluster regularization (Smith et al., 2017) given by the adjacency matrix. More specifically, the cluster regularization (without the L2-norm regularization term) takes the following form: R(W , Ω) = λtr(W ΩW T ).(A1)Given the constructed adjacency matrix A, Ω = 1 |V| (D -A) = 1 |V| L, where D is the degree matrix and L is the Laplacian matrix. Equation A1 can be reformulated as:= λ 1 ( i∈V j =i α i,j w i , w i -w j ).

(A2)

We implement the cluster regularization via sharing model weights between each pair of nodes connected by an edge and select λ 1 = 0.1.CNFGNN We use a GRU-based encoder-decoder model as the model on nodes, which has 1 GRU layer and hidden dimension 64. We use a 2-layer Graph Network (GN) with residual connections as the Graph Neural Network model on the server side. We use the same network architecture for the edge/node/global update function in each GN layer: a multi-layer perceptron (MLP) with 3 hidden layers, whose sizes are [256, 256, 128] respectively. We choose R c = 1, R s = 20 for experiments on PEMS-BAY, and R c = 1, R s = 1 for METR-LA.

A.2 CALCULATION OF COMMUNICATION COST

We denote R as the number of communication rounds for one model to reach the lowest validation error in the training stage.GRU + FMTL Using Equation A2, in each communication round, each pair of nodes exchange their model weights, thus the total communicated data amount is calculated as: We have added results using 90% and 5% data on both datasets and we show the table of inductive learning results as Table A6 . We observe that: (1) With the portion of visible nodes in the training stage increasing, the prediction error of CNFGNN decreases drastically. However, the increase of the portion of visible nodes has negligible contribution to the performance of GRU + FedAvg after the portion surpasses 25%. Since increasing the ratio of seen nodes in training introduces more complex relationships among nodes to the training data, the difference of performance illustrates that CNFGNN has a stronger capability of capturing complex spatial relationships. (2) When the ratio of 

