MERGING MODELS PRE-TRAINED ON DIFFERENT FEATURES WITH CONSENSUS GRAPH Anonymous authors Paper under double-blind review

Abstract

Learning an effective global model on private and decentralized datasets has become an increasingly important challenge of machine learning when applied in practice. Federated Learning (FL) has recently emerged as a solution to address this challenge. In particular, the FL clients agree to a common model parameterization in advance, which can then be trained collaboratively via synchronous aggregation of local model updates. However, such a strong requirement of modeling homogeneity and synchronicity across clients makes FL inapplicable to many practical scenarios. For example, in distributed sensing, a network of heterogeneous sensors sample from different data modalities of the same phenomenon. Each sensor thus requires its own specialized model. Local learning therefore happens in isolation but inference still requires merging the local models to achieve consensus. To enable isolated local learning and consensus inference, we investigate a feature fusion approach that extracts local feature representations from local models and incorporates them into a global representation for holistic prediction. We study two key aspects of this feature fusion. First, we use alignment to correspond feature components which are arbitrarily arranged across clients. Next, we learn a consensus graph that captures the high-order interactions among data sources or modalities, which reveals how data with heterogeneous features can be stitched together coherently to achieve a better prediction. The proposed framework is demonstrated on four real-life data sets including power grids and traffic networks.

1. INTRODUCTION

To improve the scalability and practicality of machine learning applications in situations where training data are becoming increasingly decentralized and proprietary, Federated Learning (FL) (McMahan et al., 2017; Yang et al., 2019a; Li et al., 2019; Kairouz et al., 2019) has been proposed as a new model training paradigm that allows data owners to collaboratively train a common model without having to share their private data with others. The FL formalism is therefore poised to resolve the computation bottleneck of model training on a single machine and the risk of privacy violation, in light of recent policies such as the General Data Protection Regulation (Albrecht, 2016). However, FL requires a strong form of homogeneity and synchronicity among the data owners (clients) that might not be ideal in practice. First, it requires all clients to agree in advance to a common model architecture and parameterization. Second, it requires clients to synchronously communicate their model updates to a common server, which assembles the local updates into a global learning feedback. This is rather restrictive in cases where different clients draw observations from a different data modality of the phenomenon being modeled. It leads to heterogeneous data complexities across clients, which in turn requires customized forms of modeling. Otherwise, enforcing a common model with high complexity might not be affordable to clients with low compute capacity; and vice versa, switching to a model with low complexity might result in the failure to unlock important inferential insights from data modalities. A variant of FL (Hardy et al., 2017; Hu et al., 2019; Chen et al., 2020) , named vertical FL, has been proposed to address the first challenge, which embraces the concept of vertically partitioned data. This concept is figuratively named through cutting the data matrix vertically along the feature axis, rather than the data axis. Existing approaches generally maintain separate local model parameters distributed across clients and global parameters on a central server. All parameters are then learned jointly, causing however a practical drawback: Coordination overhead among clients and the central server, such as engineering protocols that enable multiple rounds of communication (i.e., synchronicity) and coordination effort (i.e., homogeneity) to converge on universal choices of models and training algorithms, would be required, which can be practically expensive depending on the scale of the application. To mitigate both constraints on homogeneity and synchronicityfoot_0 satisfactorily, we ask the following question and subsequently develop an answer to it: Can we separate global consensus prediction from local model training? As shown later in our experiments, we will address this question in a real-world context of the national electricity grid, over which thousands of phasor measurement units (PMUs) were deployed to monitor the grid condition and data were recorded in real-time by each PMU (Smartgrid.gov). PMU measurements, as time series data, are owned by several parties, each of which may employ different technologies leading to heterogeneous recordings under varying sampling frequencies and measured attributes. These data may be used to train machine learning models that identify grid events (e.g., fault, oscillation, and generator trip). Such an event detection system relies on collective series measurements at the same time window but distributed across owners. Using VFL to build a common model on such decentralized and heterogeneous data is plausible but not practical, because of a lack of autonomy that facilitates coordination among the owners. To resolve the challenge, we instead introduce a feature fusion perspective to this setting, which aims to minimize coordination among clients and maximize their autonomy via a local-global model framework. Therein, each client trains a customized local model with its data modalities. The training is independent and incurs no coordination. Once trained, local feature representations of each client can then be extracted from the penultimate layer of the corresponding local models. Then, a central server collects and aggregates these representations into a more holistic global representation, used to train a model for global inference. There are two technical challenges that need to be addressed to substantiate the envisioned framework. C1. There is an ambiguity regarding the correspondence between components of local feature representations across different clients. This ambiguity arises because local models were trained separately in isolation and there is no mechanism to enforce that their induced feature dimensions would be aligned. As a matter of fact, it is possible to permute the induced feature dimensions without changing the prediction outcome. Thus, if two models are trained separately, they might end up looking at the same feature space but with permuted dimensions. C2. There are innate local interactions among subsets of clients that need to be accounted for. Naively concatenating or averaging the local feature representations accounts for the global interaction but ignores such local interactions, which are important to boost the accuracy of global prediction as shown later in our experiments. To address C1, note that the feature dimension alignment problem is discrete in nature; furthermore, there is no direct feedback to optimize for such alignment. To sidestep this challenge, we develop a neuralized alignment layer whose parameters are differentiable and can therefore be part of a larger network, including the feature aggregation and prediction layers, which can be trained end-to-end via gradient back-propagation (Section 4). To address C2, we employ graph neural networks as the global inference model, where the graph corresponds to the explicit or implicit relational structure of the data owners. As such a graph might not be given in advance, we treat the combinatorial graph structure as a random variable of a product of Bernoulli distributions whose (differentiable) parameters can also be optimized via gradient-based approach (Section 5). The technical contributions of this work are summarized below.



We formalize a feature fusion perspective for distributed learning, in settings where data is vertically partitioned. This is an alternative view to VFL but as elaborated above, is more applicable when iterative training synchronicity is not possible among clients (Section 2).1 Note that in our case, synchronicity requires co-training among clients which is a weaker constraint than its usual meaning of further requiring clients to synchronize their updates per iteration.

