HYBRID FEDERATED LEARNING FOR FEATURE & SAMPLE HETEROGENEITY: ALGORITHMS AND IMPLE-MENTATION

Abstract

Federated learning (FL) is a popular distributed machine learning paradigm dealing with distributed and private data sets. Based on the data partition pattern, FL is often categorized into horizontal, vertical, and hybrid settings. All three settings have many applications, but the hybrid FL remains relatively less explored, because it deals with the challenging situation where both the feature space and the data samples are heterogeneous. This work designs a novel mathematical model that effectively allows the clients to aggregate distributed data with heterogeneous, and possibly overlapping features and samples. Our main idea is to partition each client's model into a feature extractor part and a classifier part, where the former can be used to process the input data, while the latter is used to perform the learning from the extracted features. The heterogeneous feature aggregation is done through building a server model, which assimilates local classifiers and feature extractors through a carefully designed matching mechanism. A communicationefficient algorithm is then designed to train both the client and server models. Finally, we conducted numerical experiments on multiple image classification data sets to validate the performance of the proposed algorithm. To our knowledge, this is the first formulation and algorithm developed for hybrid FL. sharing" requirement, so it is desirable to develop a method in which the server has no access to any data, including the labels. 3. Sample synchronization. A technical challenge with VFL is that the server wants the clients to draw the same mini-bath of samples at each iteration. This challenge is exacerbated in hybrid FL since not all the clients will have the same samples. Therefore, to avoid idling clients, a hybrid FL method should allow uncoordinated sample draws. Our contributions: Towards addressing the previous challenges, this work proposes a novel model and its training method. We summarize our contribution as follows.

1. INTRODUCTION

Federated Learning (FL) is an emerging distributed machine learning (ML) framework which enables heterogeneous clients -such as organizations or mobile devices -to collaboratively train ML models (Konečnỳ et al., 2016; Yang et al., 2019) . The development of FL aims to address practical challenges in distributed learning, such as feature and data heterogeneity, high communication cost, and data privacy requirement. The challenge due to heterogeneous data is particularly evident in FL. The most well-known form of heterogeneous data is sample heterogeneity (SH), where the distributions of training samples are different across the clients (Kairouz et al., 2021; Bonawitz et al., 2019) . Severe SH can cause common FL algorithms such as FedAvg to diverge (Khaled et al., 2019; Karimireddy et al., 2020b) . Recently, better-performing algorithms and system architectures for distributed ML (including FL) under SH include Karimireddy et al. (2020b) ; Li et al. (2018) ; Wang et al. (2020) ; Fallah et al. (2020) ; Vahidian et al. (2021) . Besides SH, another form of heterogeneity is feature heterogeneity (FH). Traditionally, we say the samples are FH if we can partition them into subsets that bear distinct features. In the FL setting, when the sample subsets of different clients have different, but not necessarily distinct, features, we call it FH. That is, under FH, different clients have unique and possibly also common features. FH and SH arise in ML tasks such as collaborative medical diagnosis (Ng et al., 2021) , recommendation system (Yang et al., 2020) , and graph learning (Zhang et al., 2021) , where the data collected by different clients have different, and possibly overlapping features and sample IDs. Next, we provide a few examples. Medical diagnosis application (see Figure 1 ). The clients are clinics, and they collect data samples from patients. Each clinic may have a different set of diagnostic devices, e.g., clinic A has MRI and ultrasound, while clinic B has MRI and electrocardiographs (ECG). FH arises as the feature set of each sample collected by clinic A may partially overlap with that done by clinic B. Besides FH, SH also arises as multiple clinics may not have the chance of treating the same patient and each patient usually visit only a subset of clinics. Recommendation system (Yang et al., 2020; Zhan et al., 2010) . In this case, the clients are large retailers, and they collect samples (such as shopping records) from their customers. The retailers share a subset of common products and a subset of common customers. A third example pertains to an application of learning over multiple social networks (Zhang et al., 2021; Guo & Wang, 2020) . Here the clients are social network providers (e.g., Twitter, Facebook), and the samples are the set of participating users, their activities and relations. We summarize these three examples in Table . 1. In the previous three applications, client data can be heterogeneous in both feature and sample. Surprisingly, none of the existing FL algorithms can fully handle such data. Rather, Horizontal FL (HFL) and Vertical FL (VFL) methods can handle data with only one heterogeneity, the former with SH and the latter with FH. By keeping only the common features (and ignoring the other features), we can avoid FH and apply an HFL method. By keeping only the common samples (and discarding the remaining samples), we can avoid SH and apply a VFL method. Clearly, they both waste data. Consider the HFL algorithms (Konečnỳ et al., 2016; Karimireddy et al., 2020b; a; Dinh et al., 2021) . The clients perform multiple local model updates, and the server averages those updates and broadcasts the new model to the clients. This scheme works when the clients share the same model and their data share an identical set of features (see Figure 2b for an illustration); otherwise, the server cannot average their models. Consider the Vertical FL (VFL) algorithms (Liu et al., 2019; Chen et al., 2020) . They split the model into blocks. Each client processes a subset of the blocks while the server aggregates the processed features to compute training losses and gradients. They require all the clients to have the same set of samples (see Figure 2c ); otherwise, they cannot compute the loss and its gradient. According to Yang et al. (2019) ; Rahman et al. (2021) , the FL setting with heterogeneous feature and samples is referred to as hybrid FL. To develop a hybrid FL method, we must address the following challenges: 1. Global and local inference requires global and local models. Hybrid FL makes it possible for a client to make its local inference and also for all the clients (or the server) to make a global inference. The former requires only the features local to a client; the latter requires all the features and training a global model at the server. 2. Limited data sharing. In typical HFL, the clients do not share their local data or labels during training. In VFL, the labels are either made available by the clients to the server (Chen et al., 2020) or stored in a designated client (Liu et al., 2019) . A hybrid FL system may be subject to a "no 1. We propose a new hybrid FL approach. For each client, the model consists of a feature extractor and a subsequent classifier. The clients collaborate and share their knowledge through building a model at the server that assimilates local classifiers and feature. The assimilation is achieved by a matching mechanism inspired by the non-parametric modeling idea in Yurochkin et al. (2019) . This approach enables both global and local inferences and can handle data with both SH and FH. To our knowledge, this is the first concrete hybrid FL model in the literature. 2. We develop a hybrid FL algorithm that enables knowledge transfer among the clients. The algorithm maintains data locality, so the server does not access clients' data, and it allows uncoordinated sample draws by the clients. 3. We evaluated the performance of the hybrid FL algorithm on a number of real datasets. The learned model achieved an accuracy that was comparable to that of a centrally trained model.

1.1. RELATED WORK

Federated graph learning (FGL) is applied to molecular classification (He et al., 2021) , relation or node classification for social networks (Zhang et al., 2021; Ng et al., 2021) and financial network (Suzumura et al., 2019) . In the first application, the graphs are relatively small and the clients have large amount of graphs (Zhang et al., 2021; He et al., 2021) . In the last two application scenarios, the clients possess partial yet overlapping data of a single large graph, including partial node and edge information (Zhang et al., 2021) . However, existing FGL algorithms mainly focus on the first application scenario (He et al., 2021) and fail to deal with the latter two scenarios. So we cannot apply them to our hybrid FL setting. HFL has a popular algorithm FedAvg (Konečnỳ et al., 2016) , which adopts the computation-thenaggregation strategy. The clients locally perform a few steps of model updates, and then the server aggregates the updated local models and averages them before sending the updated global model back to the clients. Beyond model averaging, PFNM (Yurochkin et al., 2019) and FedMA (Wang et al., 2020) use a parameter-matching-based strategy and FedGKT (He et al., 2020) uses a knowledge distillation strategy to get better global model performance, and they do not require the global model to have the same size as the local models. All HFL algorithms assume their data have the same set of features. Personalized FL (PFL) has been studied as a potential way to tackle different levels of task heterogeneity. MAML (Jiang et al., 2019; Fallah et al., 2020) uses meta-learning to build global model that can fast adapt to heterogeneous data distribution; FedProx (Li et al., 2018) and LG-FedAvg (Hanzely & Richtárik, 2020) regularize the distance between the local models and the global model. MOCHA (Smith et al., 2017) and FedU (Dinh et al., 2021) combine multi-task learning with FL to train models for personalized tasks. FedPer (Arivazhagan et al., 2019) separates the model into base+personalized layers to decouple the common and personal knowledge. However, most of the algorithms assume that all local models take the same input size and format.

2. PROBLEM FORMULATION

In this section, we first provide a mathematical characterization of the heterogeneous data distributions of interest to this work. We then propose a unified hybrid FL model. 

Notation

P P ∈ {0, 1} d 1 ×d 2 , d 2 j=1 Pij = 1, ∀i ∈ [d1], d 1 i=1 Pij ∈ {0, 1}, ∀j ∈ [d2] . Data description: See Figure 3a for an illustration of a dataset with three clients, where the client datasets has no fully overlapped sample or feature, so neither HFL nor VFL can be used. We consider a hybrid FL system with M clients indexed by m ∈ [M ], and they collaborate to accomplish the same task. For convenience, we index the server as m = 0. First, assume that each sample can have at most d 0 feature blocks, and the ith block has the set D i of features, i ∈ [d 0 ]; client m has a set of d m feature blocks indexed by I m , that is, we write ⟨D im ⟩ im∈Im and write the feature space of client m as X m = im∈Im D im , which is a Cartesian product of the subset of the feature blocks possessed by client m. Similarly, we denote the "full feature" space as X 0 = d0 i=1 D i , which is the Cartesian product of all feature blocks. Second, client m holds a private dataset with index set N m and the samples (x m,n , y n ) for n ∈ N m , where x m,n ∈ X m denotes the features of the n th sample on client m, and y n denotes the label of the n th sample. Collecting all the clients' data together, we can define the (virtual) global dataset to have sample index set N 0 = [N ], with samples (x 0,n , y n ) where x 0,n ∈ X 0 denotes the "full feature" of the n th sample (for the precise relation between the full featured x 0,n and the local sample x m,n , please see the property P2 below). The dataset defined above satisfy the following properties.

P1)

The global index set is the union of the clients' index sets: N 0 = M m=1 N m , which implies N m ⊆ N 0 . P2) For a given client m, the features of the n th sample is a sub-vector of the "full features". That is, there exists a selector matrix P m such that it can map the global feature x 0,n to x m,n : x m,n = P m x 0,n , for some P m ∈ S(d m , d 0 ), where P m as a selector matrix that selects the feature blocks on client m from the full feature. Remark  H m (Θ m ; •) := [h im (θ m,im ; •)] im∈Im , and Θ m := [θ m,im ] im∈Im . The classifier F m (w m ; •) is parameterized by w m , and we denote the prediction loss function as ℓ(•, •). The data processing procedure on client m is described as follows: (1) The features x m,n of the n th sample are passed to the feature extractors {h im (θ m,im ; •)} im∈Im ; (2) The classifier F m (w m ; •) makes the prediction based on the concatenated output of the feature extractors H m (Θ m ; x m,n ); (3) The prediction F m (w m ; H m (Θ m ; x m,n )) and the true label y n together evaluates the loss ℓ(•, •). With the specified data processing procedure, the prediction loss on client m is defined as: f m (Θ m , w m ) := 1 |N m | n∈Nm ℓ(F m (w m ; H m (Θ m ; x m,n )), y n ). (3) Additionally, the server will have a model with full feature extractors, concatenated with a classifier; see the top figure in Fig. 3b . This model structure covers a wide range of ML models for classification and regression problems, e.g., image classification, language processing, and recommendation system. Remark Model matching: To enable effective collaboration among the clients, our idea is to properly match different parts of the model, by imposing a number of carefully designed regularizers. First, it is natural to assume that when client m and m ′ share the same feature block D i , the corresponding feature extractors h i (θ m,i ; •) should produce the same output, that is θ m,i ≈ θ 0,i ≈ θ m ′ ,i . Therefore, we impose the following regularizer for the feature extractors, which matches the i th feature extractor at user m with the corresponding extractor at the server: r m,1 (Θ m , Θ 0 ) := i∈Im 1 2 ∥θ m,i -θ 0,i ∥ 2 = 1 2 ∥Θ m -P m Θ 0 ∥ 2 , ( ) where P m is the data selection matrix defined in (1) and Θ m concatenates parameters defined in (2). We then design the regularizer for the classifiers. As the classifiers on different clients share partially overlapping input and identical output space, we model the client's classifiers w m as some "pruned" versions of the server-side classifier w 0 , but with unknown pruning pattern. More specifically, assume that w m ∈ R dm,w , w 0 ∈ R d0,w , we impose the following regularizer for the classifier: r m,2 (w m , Π m , w 0 ) = 1 2 ∥w m -Π m w 0 ∥ 2 , s.t. Π m ∈ S(d m,w , d 0,w ), where Π m is a selection matrix defining the unknown pruning pattern. It is important to note that, the pruning pattern matrices Π m 's are unknown and need to be optimized. On the contrary, when in the definition of the feature extractor regularizer (4), the matrix data selection matrices P m 's are fixed, and they are defined by the data partitioning pattern. Detailed discussion about the structure of the constraints on the pruning matrix Π m 's and the regularizer r m,2 are given in Appendix A.1 Overall problem formulation: By combining the models discussed in the previous two subsections, we arrive at the following training problem: min {Θm,wm} M m=0 ,{Πm} M m=1 M m=1 pm (fm(Θm, wm) + µ1 • rm,1(Θm, Θ0) + µ2 • rm,2(wm, Πm, w0)) , s.t. Πm ∈ S(dm,w, d0,w), ∀m ∈ [M ], where µ 1 , µ 2 are hyper-parameters for the regularizers; p m 's are the weights for each local problem satisfying M m=1 p m = 1, with common choices p m = 1 M or p m = |Nm| |N | . Remark 3. (Relation with HFL). When d 0 = 1, that is, there is only a single feature block across all the clients, then the data structure can be handled by the conventional HFL. Below let us discuss the relations between our model ( 6) and some popular HFL models. First note that when d 0 = 1, the feature extractor regularizer (4) reduces to r m,1 (Θ m , Θ 0 ) = 1 2 ∥Θ m -Θ 0 ∥ 2 . 1) Reduction to FedMA (Wang et al., 2020) and Sub-FedAvg (Vahidian et al., 2021) . If we set Θ m = I, i.e., the features are directly processed by the w m 's, then ( 6) is equivalent to the problem solved by FedMA and Sub-FedAvg. 2) Reduction to FedProx (Li et al., 2018) and LG-FedAvg (Hanzely & Richtárik, 2020) . By setting w m = I.p m = 1 M and letting Θ m directly predict the labels, the problem reduces to min {Θm} M m=0 1 M M m=1 (f m (Θ m ) + µ 1 • r m,1 (Θ m , Θ 0 )), which is equivalent to the formulation solved by FedProx and LG-FedAvg. 3) Reduction to FedAvg. Further by letting µ 1 → ∞ in ( 7), the regularizer enforce Θ m 's to achieve exact consensus, the problem reduces to the one solved by FedAvg. 4) Reduction to FedPer (Arivazhagan et al., 2019) . By letting µ 2 = 0 and µ 1 → ∞ in ( 6), the regularizer on w m 's is removed and Θ m 's achieve exact consensus. In this case, Θ m serves as the base layers while w m 's serve as the personalized layers, which is equivalent to the model design of FedPer. □ Remark 4. (Relation with VFL). VFL assumes that the clients cannot perform prediction independently, so it directly trains a global model with the local data (Liu et al., 2019; Chen et al., 2020) . In contrast, we assume that each client has sufficient features for independent training and construct a local model, which is further used to construct a global model. This way, we avoid data sharing and sample synchronization issues that often limit VFL use in practice. □

3. ALGORITHM DESIGN

In this section, we propose a training algorithm for the proposed Hybrid FL formulation (6). This algorithm will alternate between the server-side updates and the client-side updates. To proceed, we will first split (6) into a server-side problem and a client-side problem, and then develop algorithms to optimize each part. One key consideration in our algorithm design is to ensure that the server-side model is optimized without directly accessing any clients' data. Problem splitting: Notice that the problem contains parameter blocks {Θ m } M m=1 , {w m } M m=1 , Θ 0 , w 0 and {Π m } M m=1 . First we divide the parameters into two groups: 1) the server-side parameters Θ 0 , w 0 , and {Π m } M m=1 and 2) the client-side parameters {Θ m } M m=1 and {w m } M m=1 . By fixing the server-side parameters, (6) decomposes into m independent problems, one for each client. The problem related to client m is given by: min Θm,wm f m (Θ m , w m ) + µ 1 • r m,1 (Θ m , Θ 0 ) + µ 2 • r m,2 (w m , Π m , w 0 ). Similarly, by fixing the client-side parameters, the f m 's in (6) become constants, and the problem reduces to the following server-side problem: min Θ0,w0,{Πm} M m=1 M m=1 p m (µ 1 • r m,1 (Θ m , Θ 0 ) + µ 2 • r m,2 (w m , Π m , w 0 )) , s.t. Π m ∈ S(d m,w , d 0,w ), ∀m ∈ [M ]. Algorithm 1 Hybrid Federated Matching Algorithm (HyFEM) 1: Input: w 0 0 , Θ 0 0 , {Π 0 m } M m=1 , η, T, Q, P 2: for t = 0, . . . , T -1 do 3: for client m = 1, . . . , M in parallel do 4: Θ t,Q m , w t,Q m ←ClientUpdate Θ t 0 , Π t m , w t 0 , Q, η) // Local perturbed SGD solving (8) 5: Sends client model Θ t,Q m , w t,Q m to server 6: for server do 7: Θ t+1 0 ← M m=1 pmP T m Pm -1 M m=1 pmP T m Θ t,Q m //Exact minimization for (10) 8: w t+1 0 , {Π t+1 m } M m=1 ←ModelMatching {w t,Q m , Π t m } M m=1 , P // Solving (11) 9: Distributes server model w t+1 0 , Θ t+1 0 , {Π t+1 m } M m=1 to clients 10: Output: {w T m , Θ T m } M m=0 , {Π T m } M m=1 The above problem can be naturally separated into two sub-problems. The first sub-problems is: min Θ0 M m=1 p m • r m,1 (Θ m , Θ 0 ), and the second one is: min w0,{Πm} M m=1 M m=1 p m • r m,2 (w m , Π m , w 0 ), s.t. Π m ∈ S(d m,w , d 0,w ), ∀m ∈ [M ]. Algorithm design: We propose a block coordinate descent type algorithm called Hybrid Federated Matched Averaging (HyFEM) in Algorithm 1 to solve (6) with the above problem splitting strategy and the sub-routines are given by Algorithm 2 in Appendix A.2. In global iteration t, the clients first performs Q local perturbed SGD steps on problem (8) to optimize client models w t m , Θ t m (line 1 -7 in Algorithm 2); then the server aggregates the updated client models, updates global feature extractors by optimizing (10) that has a closed-form solution as line 7 in Algorithm 1, and match the classifiers by optimizing (11); finally, the server distributes the models and the selection matrices to clients. The major step in the algorithm is solving the sub-problem ( 11). We optimize it by the Model-Matching procedure described in line 8 -14 of Algorithm 2 in Appendix A.2: 1) for each client index m ′ , construct the server model w t,p 0 without the impact of the selected client; 2) apply the Hungarian algorithm to solve a parameter assignment problem and obtain Π t,p+1 m ′ in at most O((d m,w ) 3 ) run-time complexity Kuhn (1955) . With few rounds of updates, we obtain the server classifier and the selection matrices for each client. This procedure is inspired by the model matching algorithms Wang et al. (2020) ; Yurochkin et al. (2019) for matching parameters in deep neural networks of the same size. Our matching algorithm is a non-trivial extension to the existing model matching algorithm. Because the server-side and client-side models do not share the exact same functionality, we cannot replace the client-side models with the server-side model. Such a special property introduces some significant challenges for model matching. The detailed matching procedure is included in Appendix A.2. Remark 5. Although Algorithm 1 seems to be complicated, it can be viewed as a problem with three parameter blocks L(x, y, z), where x is the collection of {w m , Θ m } M m=1 ; y is the collection of {Π m } M m=1 and z is {w 0 , Θ 0 }. Then the update can be viewed as follows: x + ← x -η ∇x L(x, y, z) Q times , y + ← arg min y∈Range(y) L(x + , y, z), z + ← arg min z L(x + , y + , z), (12) where ∇x L(•) denotes the stochastic partial gradient estimation w.r.t. converges with rate O(1/T ). Alternatively, if we assume the solution to {Π m } M m=1 for sub-problem (11) is unique, and we update client models with one-step gradient descent, then Algorithm 1 asymptotically converges to the first-order stationary point of (6). Remark 6. Theorem 1 is a non-trivial extension of the convergence results for traditional BCD-type algorithms. The major challenges in the analysis of HyFEM are: 1) it runs multiple, yet fixed number of stochastic gradient updates on (potentially nonconvex) blocks {Θ m , w m } M m=1 , which results in non-strictly decrease; 2) the problem w.r.t. block {Π m } M m=1 is nonconvex and non-smooth and does not has a unique global minima. Such a setting is different from existing work on BCD-type algorithms. The detailed convergence statement and its proofs are given in Appendix B. □ We highlight the merits of the proposed approach: 1) Unlike the typical VFL formulations (Liu et al., 2019; Chen et al., 2020) , our approach keeps the data at the clients. Hence, the local problems are fully separable. There is no sample-drawing synchronization needed during local updates; 2) By utilizing the proposed model matching technique, we can generate a global model at the server, which makes use of full features. This makes the inference stage flexible: the clients can use either partial features (by using its local parameters (Θ m , w m )) or the full features by requesting (Θ 0 , w 0 ) from the server or letting the server do the inference. Although we formulate the problem by adopting the idea of model splitting from VFL and model pruning/matching from HFL, optimizing ( 5) is still a non-trivial procedure. Specifically, we can only train clients' classifiers w m 's of different sizes, and construct unknown server's classifier w 0 with w m 's and find Π m 's, while existing algorithms either require w 0 to be given (Vahidian et al., 2021) , or w m 's to have the same size (Wang et al., 2020; Yurochkin et al., 2019) .

4. NUMERICAL EXPERIMENTS

To evaluate the proposed algorithms, we have conducted experiments on a number of standard datasets, and compared the results with several baselines including centralized training and standalone local training (without any client-server communication). Since existing FL algorithms cannot be applied to our setting that the client features are only partial overlapped, we do not compare HyFEM with other FL algorithms in this section. However, we put an additional set of experiments in Appendix C comparing HyFEM and FedProx with less heterogeneous features. Dataset & data splitting: We consider the ModelNet40, Cifar-10, and EuroSAT datasets, the details of which are explained below. We also consider an additional multi-modal dataset, we refer the readers to Appendix C.3 for details. EuroSAT (Helber et al., 2019) : EuroSAT is a land cover classification satellite image dataset with N 0 = 27, 000 samples from 10 classes, and the images are split into 12 feature blocks the same as Cifar-10. In the training phase of each task, we manually assign a few feature blocks and classes to each client, so that the clients have partially overlapping features and samples and exhibit FH and SH. The settings are summarized in Table 2 . It is worth pointing out that in setting ModelNet40:2, 12.08% of the data have never been used by any of the clients during training, and in all settings, there is no feature or sample that is shared by all clients, so VFL and HFL algorithms cannot be applied. We conduct two sets of experiments on ModelNet40 dataset where setting 1 uses d 0 = 4 views and setting 2 uses full d 0 = 12 views. The first setting has less features, so the classifiers are smaller and the matching procedure is easier and expected to be more accurate. Thus the performance of the server model should be closer to the model obtained with centralized training. In the second setting, the matching procedure is more complex than that of the first setting and should result in worse server model performance. The illustration of the data assignment pattern is given in Appendix C.2. In the testing phase, the clients evaluate their model on all testing samples with corresponding feature blocks used in training phase. We average over the accuracies obtained by the clients to obtain the averaged local accuracy. The global accuracy is evaluated using the matched server model on all testing samples with full features. Training settings: In the experiments, we use the MLP model with one hidden layer as the classifier f m (w m ; •). We use the CNN part of ResNet-18 followed by one pooling layer as the feature extractors for Cifar-10 and EuroSAT dataset; we use the CNN part of ResNet-34 followed by one pooling layer as the feature extractors for ModelNet40 dataset. We use the following experiment settings as Numerical results: The global accuracy is shown in Figure 4 under different settings. We can see that HyFEM algorithm can train a server model with higher accuracy than stand-alone training in all settings. Moreover, the server models can achieve comparable performance as models obtained with centralized training, even none of the client has full features or full classes of the data. HyFEM can deal with data with SH and FH. As expected, in setting ModelNet40:2, the server model accuracy is lower than in setting ModelNet40:1, because the matching problem is harder for larger classifiers and 12.08% of the data have never been used by any of the clients compared with centralized training. The average client accuracy is shown in Figure 5 for different settings. Client models have lower testing accuracies compared with server models. This is reasonable as the client models are trained with partial features and biased data with partial classes. We also observe that the stand-alone accuracy under setting ModelNet40:1 is higher than ModelNet40:2, as each client has more samples. However, the accuracy improvement with HyFEM under setting ModelNet40:1 is less than setting ModelNet40:2, as the latter one uses more features. Nevertheless, HyFEM algorithm can train much better client models than stand-alone training even the clients do not share the same input space and classes. By using global model matching algorithm, the local classifiers can share knowledge with other clients on unseen classes and deal with SH.

5. CONCLUSIONS

We propose a hybrid FL framework that handles a general collaborative-learning scenario with partially overlapped features and samples. We first clarify how the data are partitioned in hybrid FL scenario and propose a generic problem formulation. We then show that the proposed formulation covers a number of horizontal and personalized FL settings, and develop a BCD based HyFEM algorithm to solve the proposed problem. Finally our numerical results on a number of image classification datasets demonstrate that the HyFEM algorithm enables clients with partial features and samples to achieve performance that is comparable to centralized training with full features. 

A HETEROGENEOUS MODEL MATCHING ALGORITHM

In this section, we describe the details for the model matching algorithm. First, we describe the motivation behind the design of the classifiers' regularizer (5), which encourage w m 's to be matched together to construct w 0 is designed. Then we present the detailed version of line 8 -14 in Algorithm 2 for optimizing the regularizer (11). A.1 REGULARIZER DESIGN Recall the regularizer for the classifiers is given by: r m,2 (w m , Π m , w 0 ) = 1 2 ∥w m -Π m w 0 ∥ 2 , s.t. Π m ∈ S(d m,w , d 0,w ), where d m,w , d 0,w are the dimensions of w m , w 0 and Π m is a selection matrix corresponds to the unknown pruning pattern. In this section, we motivate why such a regularizer is used, and how the selection matrices Π m 's are constructed. We note that the proposed matching method is a non-trivival extension of the neural matching method (Yurochkin et al., 2019) (designed for horizontal FL) to the case of hybrid FL. In Yurochkin et al. (2019) , the author considered the horizontal FL setting where the sizes and the functionalities of all the clients' models as well as the server's model are identical, so after matching, the clients can use the matched server-side model directly as their new model. However, in the considered hybrid FL setting, the input dimension of each of the client's inference block can be very different, and the serve-side and client-side models do not share the exact same functionality. Therefore we cannot replace the client-side models with the server-side model. Such a special property of the hybrid FL problem introduces some significant challenges for the matching procedure. This is the main reason that in our proposed algorithm, the matching matrices and the client/server models have to be iteratively optimized. Suppose that for each client m, its classifier f m (w m ; •) has L layers; then the inference block has the following structure: y = σ m,L (w m,L • σ m,L-1 (w m,L-1 . . . σ m,1 (w m,1 v m ) . . . )), where σ m,l (•) represents the element-wise nonlinear activation function of layer l; w m,l 's are the weight matrices of layer l, and v m is the input of the classifier which is the stacked output of the feature extractors, i.e., v m = H m (Θ m ; x). Then w m = {w m,l } L l=1 ; see Figure 6 for an illustration. Let us further define the server's output of feature extractors, activation functions and weights as v 0 {σ 0,l (• • • )} L l=1 similarly as above. Note that giving a selection matrix with appropriate shape Π ∈ S, left multiply the weight matrix w m,l by Π T (Π T w m,l ) results in selecting the rows of w m,l , which is equivalent to selecting the output neurons of the l th layer. And right multiply the weight matrix w m,l by Π (w m,l Π) results in selecting the columns of w m,l , which is equivalent to selecting the input neurons of the l th layer. The goal is to match w m,l 's with the corresponding parameters w 0,l at the server. Below, we discuss how the first layer, the middle layers and the last layer are matched. First, recall that the input of the classifiers have the following relation: H m (Θ m ; x m ) = P m H 0 (Θ 0 ; x 0 ), thatis,v m = P m v 0 , where P m is the selection matrix defined by the feature overlapping pattern between x m and x 0 . Then, let us multiply P T m on both sides of the above equation, we obtain P T m P m v 0 = P T m v m . Note that P T m v m basically pads zeros in the missing feature indices of the v m , so that it matches the size of v 0 . Let us define Π m,1 = P m . By utilizing the fact that P m ∈ S(d m , d 0 ) is a selection matrix, it holds P m P T m = I dm , then we have the following relation: σ m,1 ((w m,1 Π m,1 )(Π T m,1 v m )) = σ m,1 (w m,1 v m ) . This process expands the input v m to the same size as v 0 , while keeping the output of the first layer unchanged; see Fig. 7 for an illustration of this process. Next, we would like to find a selection matrix Π m,2 ∈ S that compresses the output of the first layer of the server to match the output of the first layer of client m, as follows: σ m,1 ((w m,1 Π m,1 )(Π T m,1 v m )) ≈ Π m,2 • σ 0,1 (w 0,1 v 0 ). (15) This output matching relation imposes the following assumption on the model parameters: w m,1 = Π m,2 w 0,1 Π T m,1 . (16) To see why ( 16) implies ( 15), we can plug ( 16) into the left hand side of ( 16), and obtain: σ m,1 ((w m,1 Π m,1 )(Π T m,1 v m )) = σ m,1 ((Π m,2 w 0,1 Π T m,1 Π m,1 )(Π T m,1 v m )) (i) = Π m,2 • σ 0,1 ((w 0,1 Π T m,1 Π m,1 )(Π T m,1 v m )) = Π m,2 • σ 0,1 ((w 0,1 Π T m,1 Π m,1 )(Π T m,1 Π m,1 v 0 )) = Π m,2 • σ 0,1 (w 0,1 Π T m,1 Π m,1 v 0 ) (ii) ≈ Π m,2 • σ 0,1 (w 0,1 v 0 ), where (i) comes from the fact that projection only changes the order and pads zeros to the output, so applying element-wise activation before or after the projection does not affect the final output; in (ii) we use the fact that Π T m,1 Π m,1 is a diagonal matrix with 1's and 0's on diagonal that can be approximated by a identity matrix. The above discussion suggests that, if ( 16) holds approximately, then (15) holds approximately. As a result, we design the regularizer on the first layer between client m and the server, by approximately enforcing ( 16) as 1 2 w m,1 -Π m,2 w 0,1 Π T m,1 2 . Let us now analyze the constraint for Π m,2 . First, since the dimension of w 0,1 is larger or equal to that of w m,1 for each client m, we require that each coordinate of w m,1 is matched to one coordinate in w 0,1 . Therefore we need Π m,2 to satisfy 1 = Π m,2 1. Further, each coordinate in w 0,1 should match to a coordinate in at least one clients m ∈ [M ]'s w m,1 , so this means M m=1 1 T Π m,2 ≥ 1. The above process is illustrated in Fig. 8 . For the l th middle layer, its input is the output of the previous layer. By fixing the projection matrices {Π m,l } M m=1 that match the output of the (l -1) th layer at each client to the output of the (l -1) th layer at the server, the matching problem for the l th middle layer takes the same form as the matching problem for the input layer: the output of the previous layer σ m,l-1 (•) corresponds to the input v m ; the projection matrices {Π m,l } M m=1 correspond to the input projection matrices {Π m,1 } M m=1 ; the goal is to find the projection matrices {Π m,l+1 } M m=1 that match the output of the l th layer at each client to the output of the same layer as the server. By using the same argument that we used for the input layer, we design the regularizer on the l th middle layer between client m and the server as For the last layer, the output at each client is the same as the server, which are the predicted label. Therefore, the projection matrix of the output for the last layer are identity matrices, and we design the regularizer for the last layer as 1 2 w m,l -Π m,l+1 w 0,l Π T m,l 2 . 1 2 w m,L -w 0,L Π T m,L 2 . Next, by vectorizing w m,l 's we have following relation: vec(Π m,l+1 w 0,l Π T m,l ) = (Π m,l ⊗ Π m,l+1 )vec(w 0,l ), where ⊗ denotes the Kronecker product. Therefore the regularizer for each layer can be rewritten as 1 2 ∥vec(w m,l ) -(Π m,l ⊗ Π m,l+1 )vec(w 0,l )∥ 2 . Finally we can stack the sub-vector {vec(w m,l )} L l=1 into w m , and define the projection matrix of the long vector as Π m := diag((Π m,1 ⊗ Π m,2 ), . . . , (Π m,L ⊗ I)). Again, Π m is a block diagonal matrix, and it is easy to verify that it satisfies the following conditions: Π T m 1 = 1, M m=1 1Π m ≥ 1; Π m ≥ 0. ( ) Finally, we obtain the final formulation of the regularizer Algorithm 2 Sub-routines for Algorithm 1 1: ClientUpdate(Θ t 0 , Π t m , w t 0 , Q, η) 2: Initialize: Θ t,0 m ← PmΘ t 0 , w t,0 m ← Π t m w t 0 3: for q = 0, . . . , Q -1 do 4: Uniformly sample n ∈ Nm 5: Θ t,q+1 m ← Θ t,q m -η ∇Θ m ℓ(Fm(w t,q m ; Hm(Θ t,q m ; xm,n)), yn) + µ1(Θ t,q m -PmΘ t 0 ) 6: w t,q+1 m ← w t,q m -η ∇w m ℓ(Fm(w t,q m ; Hm(Θ t,q m ; xm,n)), yn) + µ2(w t,q m -Π t m w t 0 ) 7: Output: Θ t,Q m , w t,Q m 8: ModelMatching({w t,Q m , Π t m } M m=1 , w t 0 , P ) 9: for p = 0, . . . , P -1 do 10: for m ′ = 1, . . . , M in parallel do 11: ŵt,p 0 ← m̸ =m ′ pm(Π t,p m ) T Π t,p m -1 m̸ =m ′ pm(Π t,p m ) T w t,Q m 12: Π t,p+1 m ′ ← arg min Π m ′ r m ′ ,2 (w t,Q m ′ , Π m ′ , ŵt,p 0 ), // Using Hungarian algorithm 13: w t+1 0 ← M m=1 pm(Π t,P m ) T Π t,P m -1 M m=1 pm(Π t,P m ) T w t,Q m , Π t+1 m ← Π t,P m 14: Output: w t+1 0 , {Π t+1 m } M m=1 Algorithm 3 Model Matching Procedure 1: ModelMatching 2: Input: {w t,Q m , Π t m } M m=1 , w t 0 , P 3: for p = 0, . . . , P -1 do 4: Uniformly sample m ′ ∈ [M ] 5: w t,p 0 ← m̸ =m ′ pm(Π t,p m ) T Π t,p m -1 m̸ =m ′ pm(Π t,p m ) T w t,Q m 6: for l = 1, . . . , L -1: do 7: Construct cost matrix C l with (20). 8: Π t,p+1 m ′ ,l ← arg min Π m ′ i,j Π m ′ (i, j) • C l (i, j), // Using Hungarian algorithm 9: w t+1 ← M m=1 pm(Π t,P m ) T Π t,P m -1 M m=1 pm(Π t,P m ) T w t,Q m , Π t+1 m ← Π t,P m 10: Output: w t+1 0 , {Π t+1 m } M m=1 r m,2 (w m , Π m w 0 ) = 1 2 ∥w m -Π m w 0 ∥ 2 , ∀ m ∈ [M ] where Π m 's satisfy: Π m ∈ S(d w,m , d w,0 ), M m=1 1 T dw,m Π m ≥ 1 T dw,0 . A.2 OPTIMIZATION PROCEDURE In this subsection, we describe the detailed procedures in line 8 -14 of Algorithm 2 to optimize the classifier matching problem (5) or its more detailed formulation (19). The procedure with more details is given in Algorithm 3. We iteratively solve the matching problem (11) for P iterations. In each iteration, we randomly pick a client m ′ to match it with the server's classifier from the first layer to the last layer. For the first L -1 layers, we first fix Π m ′ ,l-1 and construct an assignment cost matrix C l that computes the cost to match j th row in client m ′ to i th row in the server of layer l for all (i, j). The element C l (i, j) of the cost matrix is defined as: C l (i, j) = dist 1 (w 0,l Π T m,l-1 [i], w m,l [j]) w 0,l Π T m,l-1 [i] ̸ = 0 dist 2 (w m,l [j]) otherwise, where w 0,l Π T m,l-1 [i] and w m,l [j] denote the i th and j th row of the matrices, dist 1 is the similarity cost for matching w m,l [j] to an existing row, and dist 2 is the dimension penalty to match w m,l [j] to a new row in w 0,l . One specification of the cost functions is PFNM (Yurochkin et al., 2019) that uses the MAP loss of the Beta-Bernoulli process, where dist 1 is based on the Gaussian prior and dist 2 follows the Indian Buffet process prior. Then we can solve the assignment problem to obtain Π m ′ ,l with the celebrated Hungarian algorithm (Kuhn, 1955) . Note that for the first layer, the matching pattern Π m,0 is given by Π m,0 = P m . And we do not need to match the output layer.

B CONVERGENCE ANALYSIS

In this section, we analyze the convergence property of Algorithm 1. We first make the following assumptions on the problem: A 1 (Block Lipschitz Gradient) For each parameter blocks in {w m , Θ m } M m=1 , there exists an L m such that the following holds: ∇ Θm f m (Θ m , w m ) -∇ Θ ′ m f m (Θ ′ m , w ′ m ) + ∇ wm f m (Θ m , w m ) -∇ w ′ m f m (Θ ′ m , w ′ m ) ≤ L m (∥Θ m -Θ ′ m ∥ + ∥w m -w ′ m ∥) , ∀ Θ m , Θ ′ m , w m , w ′ m , A 2 (Lower Bounded Loss) There exist finite lower bounds for each client classification loss, i.e., ∃ f m > -∞, s.t. f m (Θ m , w m ) ≥ f m , ∀Θ m , w m , m. A 3 (Bounded Variance) The stochastic partial gradient estimation has bounded variance σ 2 Θ and σ 2 w , i.e., E n ∥∇ Θm ℓ(F m (w m ; H m (Θ m ; x m,n )), y n ) -∇ Θm f m (w m , Θ m )∥ 2 ≤ σ 2 Θ , ∀ Θ m , w m , ∀ m ∈ [M ], E n ∥∇ wm ℓ(F m (w m ; H m (Θ m ; x m,n )), y n ) -∇ wm f m (w m , Θ m )∥ 2 ≤ σ 2 w , ∀ Θ m , w m , ∀ m ∈ [M ]. We can abstract HyFEM to a BCD-type algorithm by redefining the model parameters and the problem as follows: 1. Define x := [Θ 1 ; . . . ; Θ M ; w 1 ; . . . ; w M ], y := [vec(Π 1 ); . . . , vec(Π M )], and z := [Θ 0 ; w 0 ].

2.. Define

L(x, y, z) := M m=1 p m (f m (Θ m , w m ) + µ 1 • r m,1 (Θ m , Θ 0 ) + µ 2 • r m,2 (w m , Π m , w 0 )) . Then the optimization problem (6) can be simplified as: min x,y,z L(x, y, z), s.t. y ∈ Range(y). Moreover, the algorithm can be simplified as: x t,q+1 = x t,q -η ∇x L(x t,q , y t , z t ), forq = 0, . . . , Q -1 (22a) y t+1 = arg min y∈Range(y) L(x t,Q , y, z t ), z t+1 = arg min z L(x t,Q , y t+1 , z), where we assume x t+1,0 = x t+1 = x t,Q and ∇x L(•) denotes the stochastic partial gradient of x. We make the following assumptions to problem (21). A 4 (Block Lipschitz Gradient) L is block smooth, and for parameter x and z, there exists positive constants L x , L z and C x such that the following holds: ∥∇ x L(x, y, z) -∇ x ′ L(x ′ , y, z)∥ ≤ L x ∥x -x ′ ∥ , ∀ x, x ′ , z, ∀y ∈ Range(y). ∥∇ x L(x, y, z) -∇ x L(x, y, z ′ )∥ ≤ C x ∥z -z ′ ∥ , ∀ z, z ′ , x, ∀y ∈ Range(y). A 5 (Block Strong Convexity of z) For parameter z, there exists a positive constant µ such that the following holds: L(x, y, z ′ ) ≥ L(x, y, z) + ⟨∇ z L(x, y, z), z ′ -z⟩ + µ 2 ∥z ′ -z∥ 2 , ∀ x, z, z ′ , ∀y ∈ Range(y). A 6 (Unbiased Stochastic Partial Gradient) The stochastic partial gradient of x is unbiased: E ∇x L(x, y, z) = ∇ x L(x, y, z), ∀ x, z, ∀y ∈ Range(y). A 7 (Bounded Variance of Stochastic Partial Gradient) The stochastic partial gradient of x has bounded variance σ 2 : E ∇x L(x, y, z) -∇ x L(x, y, z) 2 ≤ σ 2 , ∀ x, z, ∀y ∈ Range(y). A 8 (Lower Bounded Function) The problem L is bounded from below, i.e., ∃L > -∞, s.t. L(x, y, z) ≥ L, ∀x, z, ∀y ∈ Range(y). A 9 (Compact Constraint Set) For parameter y, the constraint set Range(y) is compact. Note that in A4 and A5, we only assume blocks x, z are smooth, and only block z is strongly convex while block y can be non-smooth and non-convex and x can potentially be non-convex. Further we assume that ∇ x L is smooth w.r.t. z, which is non-standard, but we can prove that it hold for problem (6). The rest assumptions A6-A9 are common assumptions when analyzing stochastic algorithms. Further we can verify that the above assumptions hold for the original problem (6). Lemma 1 Suppose (6) satisfies assumptions A1-A3, then it satisfies A4-A9 with the constants in the assumptions given as: L x = max m {p m L m + max{µ 1 , µ 2 }}, C x = max{µ 1 , µ 2 }, µ ≥ min m {p m } • min{µ 1 , µ 2 }, σ 2 = σ 2 Θ + σ 2 w , L = M m=1 p m f m . The proof is given in Section B.2. Then we have the following result: Theorem 2 Suppose the problem (21) satisfies A4-A9 and run (22) for T iterations with stepsize η ≤ min{ 1 Lx , 8µ 5C 2 x }. Then the sequence {x t,q , y t , z t } T t=0 generated by (22) satisfies: 1 T Q T -1 t=0 E µ η z t+1 -z t 2 + ∇ x L(x t+1 , y t+1 , z t ) 2 + Q q=0 ∇ x L(x t,q , y t , z t ) 2 ≤ 10 T Qη L(x 0 , y 0 , z 0 ) -L + 5L x η + 2L 2 x η 2 Q σ 2 , ( ) and ∥∇ z L(x t , y t , z t )∥ 2 = 0, ∀t ∈ [T ]. This result indicates that by setting Q = T, η = 2(L(x 0 ,y 0 ,z 0 )-L) LxQT σ 2 , the right-hand-side (RHS) of ( 23) becomes 10σ √ 2(L(x 0 ,y 0 ,z 0 )-L)Lx T + 2Lx(L(x 0 ,y 0 ,z 0 )-L) T 3 = O( 1 T ). Let us analyze the left-handside (LHS) terms of (23). First, we have 1 T Q T -1 t=0 E µ η z t+1 -z t 2 = µ T T -1 t=0 E z t+1 -z t 2 = O 1 T , indicating that E z t+1 -z t 2 = O 1 T . Second, we have 1 T Q T -1 t=0 E ∇ x L(x t+1 , y t+1 , z t ) 2 + Q q=0 ∇ x L(x t,q , y t , z t ) 2 = O 1 T , where the LHS is the sum of T (Q + 2) terms of ∥∇ x L∥ 2 divide by T Q, which also indicates that E ∥∇ x L(x t,q , y t , z t )∥ 2 = O 1 T . Together we have that algorithm (22) finds a stationary solution of (21) w.r.t. x, z with rate O 1 T . Combining Theorem 2 with Lemma 1, we have that by running Algorithm 1, parameters {w m , Θ m } M m=0 converges to their stationary point of ( 6), while {Π m } M m=1 stays in a compact set. Alternatively, if we assume the solution to y is unique, and update on x is a one-step gradient descent, i.e., Q = 1 and x t+1 = x t -η∇ x L(x t , y t , z t ), then by applying (Razaviyayn et al., 2013, Theorem 2) , Algorithm 1 asymptotically converges to the first-order stationary point of (6).

B.1 PROOF FOR THEOREM 2

We begin with proving the following descent result: E t L(x t+1 , y t+1 , z t+1 ) -L(x t , y t , z t ) ≤ - η 2 Q-1 q=0 E t ∇ x L(x t,q , y t , z t ) 2 - µ 2 z t+1 -z t 2 + QL x η 2 σ 2 2 , ( ) where we denote the expectation conditioned on the information up to iteration t as E t . First, we write the LHS of the above equation into three terms as below: L(x t+1 , y t+1 , z t+1 ) -L(x t , y t , z t ) = L(x t+1 , y t+1 , z t+1 ) -L(x t+1 , y t+1 , z t ) + L(x t+1 , y t+1 , z t ) -L(x t+1 , y t , z t ) + L(x t+1 , y t , z t ) -L(x t , y t , z t ) . We bound the three terms on the RHS of the above equation separately. 1) The first term L(x t+1 , y t+1 , z t+1 ) -L(x t+1 , y t+1 , z t ) can be bounded by applying A5: L(x t+1 , y t+1 , z t+1 ) -L(x t+1 , y t+1 , z t ) A5 ≤ -∇L z (x t+1 , y t+1 , z t+1 ), z t -z t+1 - µ 2 z t+1 -z t 2 (a) = - µ 2 z t+1 -z t 2 , ( ) where in (a) uses update rule (22c) that by exact minimization ∇ z L(x t+1,y t+1 ,z t+1 ) = 0. 2) By the update rule (22b), the second term L(x t+1 , y t+1 , z t ) -L(x t+1 , y t , z t ) can be bound by L(x t+1 , y t+1 , z t ) -L(x t+1 , y t , z t ) ≤ 0. ( ) 3) The third term L(x t+1 , y t , z t ) -L(x t , y t , z t ) can be further decompose into: L(x t+1 , y t , z t ) -L(x t , y t , z t ) = L(x t,Q , y t , z t ) -L(x t,0 , y t , z t ) = Q-1 q=0 L(x t,q+1 , y t , z t ) -L(x t,q , y t , z t ) , where the first inequality uses the definition that x r,Q = x r+1 and x r,0 = x r . Then we bound each term in the summation as: L(x t,q+1 , y t , z t ) -L(x t,q , y t , z t ) A4 ≤ ∇ x L(x t,q , y t , z t ), x r,q+1 -x r,q + L x 2 x t,q+1 -x t,q 2 (22a) = -η ∇ x L(x t,q , y t , z t ), ∇x L(x t,q , y t , z t ) + L x η 2 2 ∇x L(x t,q , y t , z t ) 2 . ( ) Taking expectation on (t, q), we have: E t,q L(x t,q+1 , y t , z t ) -L(x t,q , y t , z t ) ≤ -η ∇ x L(x t,q , y t , z t ), E t,q ∇x L(x t,q , y t , z t ) + L x η 2 2 E t,q ∇x L(x t,q , y t , z t ) 2 (a) = η ∇ x L(x t,q , y t , z t ) 2 + L x η 2 2 ∇ x L(x t,q , y t , z t ) 2 + L x η 2 2 E t,q ∇x L(x t,q , y t , z t ) -∇ x L(x t,q , y t , z t ) 2 A7 ≤ -η - L x η 2 2 ∇ x L(x t,q , y t , z t ) 2 + L x η 2 σ 2 2 , where (a) first applies the fact that E(X 2 ) = (E X) 2 + E((X -E(X)) 2 ) to the second therm, then applies A6 to the first and the second term. By picking η ≤ 1 Lx and substituting ( 30) to ( 28), we have: E t L(x t+1 , y t , z t ) -L(x t , y t , z t ) ≤ - η 2 Q-1 q=0 E t ∇ x L(x t,q , y t , z t ) 2 + QL x η 2 σ 2 2 . ( ) Then we substitute ( 26), ( 27) and (31) back to (25), then we obtain (24). To prove Theorem 2, we need to further bound ∇ x L(x t+1 , y t , z t ) 2 and ∇ x L(x t+1 , y t+1 , z t ) 2 . We bound them as follows. Term ∇ x L(x t+1 , y t , z t ) 2 can be bound as: E t,Q-1 ∇ x L(x t+1 , y t , z t ) 2 (a) ≤ 2 ∇ x L(x t,Q-1 , y t , z t ) 2 + E t,Q-1 2 ∇ x L(x t,Q , y t , z t ) -∇ x L(x t,Q-1 , y t , z t ) 2 A4 ≤ 2L 2 x E t,Q-1 x t,Q -x t,Q-1 2 + 2 ∇ x L(x t,Q-1 , y t , z t ) 2 (22a) = 2L 2 x η 2 E t,Q-1 ∇x L(x t,Q-1 , y t , z t ) 2 + 2 ∇ x L(x t,Q-1 , y t , z t ) 2 A7 ≤ 2 + 2L 2 x η 2 ∇ x L(x t,Q-1 , y t , z t ) 2 + 2L 2 x η 2 σ 2 , where in (a) we add and subtract ∇ x L(x t,Q-1 , y t , z t ) and apply Cauchy-Schwarz inequality. Similarly, term ∇ x L(x t+1 , y t+1 , z t ) 2 can be bound as: ∇ x L(x t+1 , y t+1 , z t ) 2 (a) ≤ 2 ∇ x L(x t+1 , y t+1 , z t ) -∇ x L(x t+1 , y t+1 , z t+1 ) 2 + 2 ∇ x L(x t+1 , y t+1 , z t+1 ) 2 A4 ≤ 2C 2 x z t+1 -z t 2 + 2 ∇ x L(x t+1 , y t+1 , z t+1 ) 2 , where in (a) we add and subtract ∇ x L(x t+1 , y t+1 , z t+1 ) and apply Cauchy-Schwarz inequality. Then we sum the above results as (24) × 2 + (32) × η 5 + (33) × η 5 and obtain the following: 2 E t L(x t+1 , y t+1 , z t+1 ) -2L(x t , y t , z t ) + η 5 E t,Q-1 ∇ x L(x t+1 , y t , z t ) 2 + η 5 ∇ x L(x t+1 , y t+1 , z t ) 2 ≤ -η Q-1 q=0 E t ∇ x L(x t,q , y t , z t ) 2 -µ z t+1 -z t 2 + QL x η 2 σ 2 + 2C 2 x η 5 z t+1 -z t 2 + 2η 5 ∇ x L(x t+1 , y t+1 , z t+1 ) 2 + 2η + 2L 2 x η 3 5 ∇ x L(x t,Q-1 , y t , z t ) 2 + 2L 2 x η 3 σ 2 5 . Rearrange the terms, notice that we choose η ≤ 1 Lx , so that 2η + 2L 2 x η 3 ≤ 4η, we have: η 5 Q q=0 E t ∇ x L(x t,q , y t , z t ) 2 + η 5 E t ∇ x L(x t+1 , y t+1 , z t+1 ) 2 + µ - 2C 2 x η 5 z t+1 -z t 2 ≤ 2 L(x t , y t , z t ) -E t L(x t+1 , y t+1 , z t+1 ) + 2L 2 x η 3 5 + QL x η 2 σ 2 . (35) Sum the above equation from t = 0 to T -1, choose µ - 2C 2 x η 5 ≥ µ 5 (η ≤ 8µ 5C x ), and devide both side by ηQT 5 , then Theorem 2 is proved.

B.2 PROOF FOR LEMMA 1

In this section, we verify the assumptions A4-A9 for the original problem (6) under assumptions A1-A3. Recall that we have the following corresponcance: x := [Θ 1 ; . . . ; Θ M ; w 1 ; . . . ; w M ], y := [vec(Π 1 ); . . . , vec(Π M )], z := [Θ 0 ; w 0 ], L(x, y, z) := M m=1 p m (f m (Θ m , w m ) + µ 1 • r m,1 (Θ m , Θ 0 ) + µ 2 • r m,2 (w m , Π m , w 0 )) , r m,1 (Θ m , Θ 0 ) = 1 2 ∥Θ m -P m Θ 0 ∥ 2 , r m,2 (w m , Π m , w 0 ) = 1 2 ∥w m -Π m w 0 ∥ 2 , s.t. Π m ∈ S(d w,m , d w,0 ), M m=1 1 T dw,m Π m ≥ 1 T dw,0 1) For A4, we have ∇ x L(x, y, z) = p m ∇ Θm f m (Θ m , w m ) + p m µ 1 (Θ m -P m Θ 0 ) p m ∇ wm f m (Θ m , w m ) + p m µ 2 (w m -Π m w 0 ) M m=1 . Therefore we have the following bound: ∥∇ x L(x, y, z) -∇ x L(x ′ , y, z)∥ = M m=1 p m ∥∇ Θm f m (Θ m , w m ) + µ 1 Θ m -∇ Θm f m (Θ ′ m , w ′ m ) -µ 1 Θ ′ m ∥ + M m=1 p m ∥∇ wm f m (Θ m , w m ) + µ 2 w m -∇ wm f m (Θ ′ m , w ′ m ) -µ 2 w ′ m ∥ ≤ M m=1 p m (∥∇ Θm f m (Θ m , w m ) -∇ Θm f m (Θ ′ m , w ′ m )∥ + µ 1 ∥Θ m -Θ ′ m ∥) + M m=1 p m (∥∇ wm f m (Θ m , w m ) -∇ wm f m (Θ ′ m , w ′ m )∥ + µ 2 ∥w m -w ′ m ∥) A1 ≤ M m=1 p m ((L m + µ 1 ) • ∥Θ m -Θ ′ m ∥ + (L m + µ 2 ) • ∥w m -w ′ m ∥) ≤ max m {p m L m + max{µ 1 , µ 2 }} M m=1 (∥Θ m -Θ ′ m ∥ + ∥w m -w ′ m ∥) = L x ∥x -x ′ ∥ . where we obtain L x = max m {p m L m + max{µ 1 , µ 2 }}. Also, we have 2) Next, we verify A5. We proceed by directly computing the second derivitive of z: ∥∇ x L(x, y, z) -∇ x L(x, y, z ′ )∥ = M m=1 p m (µ 1 ∥P m (Θ 0 -Θ ′ 0 )∥ + µ 2 ∥Π m (w 0 -w ′ 0 )∥) ≤ M m=1 p m (µ 1 ∥P m ∥ ∥Θ 0 -Θ ′ 0 ∥ + µ 2 ∥Π m ∥ ∥w 0 -w ′ 0 ∥) (a) = M m=1 p m (µ 1 ∥Θ 0 -Θ ′ 0 ∥ + µ 2 ∥w 0 -w ′ 0 ∥) (b) ≤ max{µ 1 , µ 2 } (∥Θ 0 -Θ ′ 0 ∥ + ∥w 0 -w ′ 0 ∥) = C x ∥z -z ′ ∥ . ∇ 2 z L(x, y, z) = M m=1 p m ∇ 2 Θ0 µ 1 • r m,1 + µ 2 • r m,2 ∇ Θ0 ∇ w0 µ 1 • r m,1 + µ 2 • r m,2 ∇ Θ0 ∇ w0 µ 1 • r m,1 + µ 2 • r m,2 ∇ 2 w0 µ 1 • r m,1 + µ 2 • r m,2 = µ 1 • M m=1 p m P T m P m 0 0 µ 2 M m=1 p m Π T m Π m . Then we analyze the range of the eigenvalues of this matrix. First we know that Π m , P m 's are selection matricies, therefore P T m P m , Π T m Π m are diagonal matricies, indicating that ∇ 2 z L(x, y, z) is also a diagonal matrix. For the first block µ 1 • M m=1 p m P T m P m , we have that P m 's are the feature selection matrix, i.e., x m = P m x 0 , x m ∈ X m = i∈Im D i , x 0 ∈ X 0 = d0 i=1 D i . It is clear that if client m has the i th feature, then the i th diagonal entry of P T m P m is P T m P m (i, i) = 1, and P T m P m (i, i) = 0 otherwise. That is, the following holds: P T m P m (i, i) = 1, i ∈ I m , 0, i / ∈ I m . Further we have that the full feature space X 0 is the union of the clients' feature spaces, i.e., At this point, we have verified A4-A9 for problem (6) with Algorithm 1, and the corresponding constants are summarized below: L x = max m {p m L m + max{µ 1 , µ 2 }}, C x = max{µ 1 , µ 2 }, µ ≥ min m {p m } • min{µ 1 , µ 2 }, σ 2 = σ 2 Θ + σ 2 w , L = M m=1 p m f m . This completes the proof for Lemma 1.

C ADDITIONAL NUMERICAL EXPERIMENTS

In this section, we include additional sets of numerical experiments. In the first set of additional experiments, we reduce the feature heterogeneity of the data on the clients by allowing clients to have common features, so that HFL algorithms such as FedProx and FedAvg applies. Further, we include an additional multi-modal dataset with both image and text features.

C.1 COMPARISON WITH HFL

In this section, we conduct numerical experiments to compare FedProx (Li et al., 2018) with HyFEM. In the experiments, we split the features into d 0 = 4 blocks for the datasets and assign the first and the third blocks as the common blocks for all M = 6 clients. Then we can apply FedProx to train a model with the overlapped features and compare with the models trained with HyFEM with more features. The detailed data assignment patterns for different datasets are described in Table 3 -4. Note that in MultiView40 dataset, there are 6.88% of the data has never been used by HyFEM and 50% of the data has never been used by FedProx. For Cifar-10 and EuroSAT datasets, all data has been used by at least one client with HyFEM while 50% of the data are dropped by FedProx. For Cifar-10 and EuroSAT datasets, we split each image into (top left, top right, bottom left, bottom right) total d 0 = 4 feature blocks. For MultiView40 dataset, we choose four views of different angles of the objects as the full feature space. The total communication round is T = 64 and local update # Q = 32 are fixed for all experiments. We conducted line search on the learning rate η and µ 2 for the algorithms to obtain the best performance. The results for the datasets are shown in Figure 9 and Figure 10 . From the results, we can see that the models trained with HyFEM can obtain better performance than FedProx. This is because HyFEM is able to use more data than FedProx by using heterogeneous models. 

C.3 EXPERIMENTS ON HERIGRAPH DATASET

HeriGraph (Bai et al., 2022) dataset is a multi-modal dataset for heritage site classification. This dataset consists of total N 0 = 41, 621 samples from 9 classes. Each sample has at most d 0 = 4 preprocessed feature blocks, including one text feature block and three different image feature blocks. Note that not all samples have all features. For example, only 25, 325 samples have text features, and the rest do not. We will include the result for this dataset in the revised manuscript. In the experiments, we use the MLP model with one hidden layer as the classifier f m (w m ; •). We use MLP of different sizes as feature extractors for each feature block. We set client number M = 6, client feature block number d m = 2 and each client has 6 out of 9 classes.



Figure 1: The heterogeneous data distribution in a medical diagnosis example.

Figure 2: The data distribution patterns of a) heterogeneous client data; b) HFL and c) VFL. Medical diagnosis application (see Figure 1). The clients are clinics, and they collect data samples from patients. Each clinic may have a different set of diagnostic devices, e.g., clinic A has MRI and ultrasound, while clinic B has MRI and electrocardiographs (ECG). FH arises as the feature set of each sample collected by clinic A may partially overlap with that done by clinic B. Besides FH, SH also arises as multiple clinics may not have the chance of treating the same patient and each patient usually visit only a subset of clinics.

Due to the nature of hybrid FL, we must carefully set up its notation. We denote the all one (column) vector of length d as 1 d ; the identity matrix of size d as I d ; the positive integer set {1, 2, . . . , N } as [N ]. Feature selection below uses a selector matrix of dimension d 1 × d 2 , which belongs to the following set:

Figure 3: The partitioned data and notations, and the structure of the client and server models with heterogeneous feature extractors and classifiers. S(d1, d2) := P P ∈ {0, 1} d 1 ×d 2 ,

x. □ Theorem 1 (Informal) Suppose that for each m ∈ [M ], f m has Lipschitz continuous gradients w.r.t. [Θ m , w m ] and that w 0 has a fixed dimension. Then with stepsize η = O(1/ √ QT ) and client update Q = O(T ), by running Algorithm 1, the expected gradient norm square w.r.t. {Θ m , w m } M m=1 converges with rate O(1/T ) and the successive update difference w t+1 0

Wu et al., 2015): ModelNet40 is a multiview object classification dataset that has 12 views from different angles as 12 feature blocks for each object. The dataset has N 0 = 40, 000 samples from 40 classes. Cifar-10 (Krizhevsky, 2009): Cifar-10 is an image classification dataset with N = 50, 000 samples from 10 classes. We manually split each image into (top left,top right,bottom left,bottom right)×(red,green,blue) blocks, resulting in total d 0 = 12 feature blocks.

Figure 4: Test accuracy of server model trained with HyFEM compared with centralized training and standalone training for a) ModelNet40:1, b) ModelNet40:2, c) Cifar-10, and d) EuroSAT datasets. comparison: Centralized training: we train a full-sized server model with all data. This setting serves as the performance upper bound among all trained model. Stand-alone training: each client trains a client model only with local data, and without any communication. This setting serves as the baseline (and the performance lower bound) of HyFEM. In all settings, we fix the total number of updates (i.e., T • Q = 4096, with T = 128, Q = 32) for fair comparison and tune the learning rate to achieve the optimal performance for each experiment separately.

Figure 5: Averaged test accuracy of all clients trained with HyFEM compared with centralized training and stand-alone training for a) ModelNet40:1, b) ModelNet40:2, c) Cifar-10, and d) EuroSAT datasets.

Figure 6: Illustration of the layer structure of the inference blocks on the clients, for the L = 3 case.

Figure 7: Aligning the input of the first layer by rearranging and padding the corresponding coordinates of the input v m and the first layer w m,1 .

Figure 8: Aligning the output of the first layer by rearranging and padding the corresponding coordinates of the first layer w m,1 .For the last layer, the output at each client is the same as the server, which are the predicted label. Therefore, the projection matrix of the output for the last layer are identity matrices, and we design the regularizer for the last layer as

where in (a) we use the fact that P m ∈ S(d m , d 0 ), Π m ∈ S(d w,m , d w,0 ) are selection matrices so that ∥P m ∥ = 1, ∥Π m ∥ = 1; (b) uses the fact that M m=1 p m = 1. Therefore A4 is verified.

m∈[M ] I m = [d 0 ]. Therefore, we have Hessien matrix ∇ 2 z L(x, y, z) is positive definite, with smallest eigenvalue µ ≥ min m {p m } • min{µ 1 , µ 2 }. Thus A5 is verified.3) A6 holds true as in Algorithm 1, we uniformly samples n ∈ N m for all m ∈ [M ], thereforeE n ∇ Θm ℓ(F m (w m ; H m (Θ m ; x m,n )), y n ) = ∇ Θm f m (Θ m , w m ), E n ∇ wm ℓ(F m (w m ; H m (Θ m ; x m,n )), y n ) = ∇ wm f m (Θ m , w m ),Further, from A3, we can obtain A7 with σ 2 = σ 2 Θ + σ 2 w .Client index m Assigned features I m Assigned class

f m (Θ m , w m ) + µ 1 • r m,1 (Θ m , Θ 0 ) + µ 2 • r m,2 (w m , Π m , w 0 )) m := L, where (a) uses the fact that r m,1 (Θ m , Θ 0 ) = 1 2 ∥Θ m -P m Θ 0 ∥ 2 ≥ 0 and r m,2 (w m , Π m , w 0 ) = 1 2 ∥w m -Π m w 0 ∥ 2 ≥ 0.5) A9 directly comes from the constraint on Π m 's that Π m ∈ S(d w,m , d w,0 ),

Figure 9: Test accuracy of server model trained with HyFEM compared with FedProx for a) ModelNet40, b) Cifar-10, and c) EuroSAT datasets.

DATA PARTITIONING PATTERN In this subsection, we provide the data partitioning patterns for each settings. Notice that in Figure 11(b), and Figure 12(a), the black boxes with 0 inside them indicate that the corresponding feature block of the samples in this class has not been used for training by any of the clients.

Examples of applications that generates heterogeneous data

Similar to VFL, we split the ML model into feature extractors and classifiers. Each feature extractor takes a feature block as input and extracts an intermediate feature as output; the classifier takes the concatenated intermediate features of multiple feature extractors as input and outputs the prediction.As illustrated by Figure3b, on client m, the feature extractor h im (θ m,im ; •) for input feature block D im is parameterized by θ m,im for all i m ∈ I m . The feature extractors can have different neural network architectures (e.g., CNN for CT/MRI images, LSTM/Transformer for medical records, and 1-D CNN for ECG data). We denote the concatenated feature extractors and its parameters as:

Experiment settings for each dataset. d 0 , d m denote the # of feature blocks; N 0 , N m denote the # of samples; M denotes the number of clients.

The data assignment pattern for MultiView40 dataset. Note that 6.88% of data has never been used.

The data assignment pattern for Cifar-10 and EuroSAT dataset. 4) To verify A8, we apply A2 that: L(x, y, z) =

annex

 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 4 The result is shown in Figure 13 . The server model trained with HyFEM has comparable performance as the centralized trained model. The average performance of the clients' models has worse performance than full model due to lack of full features and classes, but the accuracy is 20% higher than the models obtained with stand-alone training.1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 4 

