GLASU: A COMMUNICATION-EFFICIENT ALGORITHM FOR FEDERATED LEARNING WITH VERTICALLY DISTRIBUTED GRAPH DATA Anonymous authors Paper under double-blind review

Abstract

Vertical federated learning (VFL) is a distributed learning paradigm, where computing clients collectively train a model based on the partial features of the same set of samples they possess. Current research on VFL focuses on the case when samples are independent, but it rarely addresses an emerging scenario when samples are interrelated through a graph. For graph-structured data, graph neural networks (GNNs) are competitive machine learning models, but a naive implementation in the VFL setting causes a significant communication overhead. Moreover, the analysis of the training is faced with a challenge caused by the biased stochastic gradients. In this paper, we propose a model splitting method that splits a backbone GNN across the clients and the server and a communication-efficient algorithm, GLASU, to train such a model. GLASU adopts lazy aggregation and stale updates to skip aggregation when evaluating the model and skip feature exchanges during training, greatly reducing communication. We offer a theoretical analysis and conduct extensive numerical experiments on real-world datasets, showing that the proposed algorithm effectively trains a GNN model, whose performance matches that of the backbone GNN when trained in a centralized manner.

1. INTRODUCTION

Vertical federated learning (VFL) is a newly developed machine learning scenario in distributed optimization, where clients share data with the same sample identity but each client possesses only a subset of the features for each sample. The goal is for the clients to collaboratively learn a model based on all features. Such a scenario appears in many applications, including healthcare, finance, and recommendation systems (Chen et al., 2020b; Liu et al., 2022) . For example, in healthcare, each hospital may collect partial clinical data of a patient such that their conditions and treatments are best predicted through learning from the data collectively; in finance, banks or e-commerce providers may jointly analyze a customer's credit with their trade histories and personal information; and in recommendation systems, online social/review platforms may collect a user's comments and reviews left at different websites to predict suitable products for the user. Most of the current VFL solutions (Chen et al., 2020b; Liu et al., 2022) treat the case where samples are independent, but omit their relational structure. However, the pairwise relationship between samples emerges in many occasions and it can be crucial in several learning scenarios, including the low-labeling-rate scenario in semi-supervised learning and the no-labeling scenario in selfsupervised learning. Take the financial application as an example: customers and institutions are related through transactions. Such relations can be used to trace finance crimes such as money laundering, to assess the credit risk of a customer, and even to recommend products to them. Each bank and e-commerce provider can infer the relations of the financial individuals registered to them and create a relational graph, in addition to the individual customer information they possess. One of the most effective machine learning models to handle relational data is graph neural networks (GNNs) (Kipf & Welling, 2016; Hamilton et al., 2017; Chen et al., 2018; Velickovic et al., 2018; Chen et al., 2020a) . This model performs neighborhood aggregation in every feature transformation layer, such that the prediction of a graph node is based on not only the information of this node but also that of its neighbors. Although GNNs have been used in federated learning, a majority

