DATALESS KNOWLEDGE FUSION BY MERGING WEIGHTS OF LANGUAGE MODELS

Abstract

Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-ofdomain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.

1. INTRODUCTION

The dominant paradigm for solving NLP tasks ranging from classification to sequence tagging involves fine-tuning a pretrained language model (PLM) using task-specific labeled data (Devlin et al., 2019; He et al., 2021) . This results in specialized models that are explicitly trained to run inference over a single domain and task. Multi-task learning has shown that leveraging information across domains or tasks can be beneficial if the data sets, data set size and algorithms are well selected (Phang et al., 2018; Pruksachatkun et al., 2020; Poth et al., 2021; Weller et al., 2022) . Combining knowledge of multiple data sets in a single model can lead to better overall performance on in-domain data (Poth et al., 2021) , can better generalize on out-of-domain data (Wang et al., 2020b) and results in a model that is more practical and parameter efficient than maintaining specialized models. However, the multi-task learning setup suffers from two practical limitations. First, the training process requires access to the original labeled data, which may not be realistic as annotated data may be private to the agent fine-tuning the model which can happen in order to ensure data or annotation privacy or to guard intellectual property to annotations. Second, because a significant amount of data or task combinations are not beneficial to performance (Poth et al., 2021) , building a single model requires training on all data set combinations to identify the optimal one, which can be prohibitive especially if there are many available source data sets or models. Model merging is defined as combining multiple models into a single one in parameter space without access to data (Matena & Raffel, 2021) . This technique provides an alternative to building a single model while satisfying data privacy constraints. Weight merging algorithms usually also have a closed-form solution, making them very efficient as no retraining is necessary, thus enabling usage even when a large number of data sets or model combinations are available. Merging can be considered as an alternative to model ensembling (Opitz & Maclin, 1999; Rokach, 2010) , where the outputs of individual models are combined to produce the final prediction. Model merging algorithms are a key step in federated learning (McMahan et al., 2017; Lin et al., 2022) , where multiple agents train their own model using private data and share only model updates with other models. However, in federated learning, model merging happens in multiple rounds of updates, after which the merged model is broadcast to all agents before the next round of training with private data. This dataless model merging is thus an extreme case of federated learning, where a single round of synchronization is admissible. Figure 1 provides an overview of the various related setups. We thus aim to use model merging to build a single model that can be used for inference on multiple domains or tasks and can generalize to new domains, in line with Wang et al. (2020b) . In contrast, simple averaging of weights for model merging was used by existing works such as Wortsman et al. (2022) to improve the performance of a specific model, where weight averaging was done over models fine-tuned using the same data set with different hyperparameters. Separately, Matena & Raffel (2021) focus on improving performance over a single target task by leveraging models trained on other donor tasks by merging models using Fisher-weighted averaging. This paper focuses on merging fine-tuned models that originate from pre-trained language models with the same architecture and pretrained weights. We introduce a novel model merging method named Regression Mean (RegMean), which is computationally efficient and extendable to merging any number of models. The method is inspired by the optimal solution for linear models that minimizes ℓ 2 distance between merged and individual models and has a closed form solution. We evaluate model merging algorithms in setups that range in complexity and type of fused knowledge. The experimental results across multiple model types (e.g. RoBERTa, T5, DeBERTa) show that our proposed method consistently and significantly outperforms other model merging and ensembling baselines and achieves higher generalization performance than the best individual models on out-of-domain data sets across several data collections. Our contributions are three-fold: (1) A novel model merging algorithm (Regression Mean); (2) an evaluation protocol for model merging algorithms that tests both in-domain and out-of-domain generalization ability; (3) analysis of computation and parameter efficiency across setups.

2. DATALESS MODEL MERGING FOR KNOWLEDGE FUSION

We consider the problem formulation that there are two main roles in the framework: (1) the agents (e.g., individuals or organizations) that train and release models; (2) the developers who aim to build a single model by fusing knowledge of multiple available models. Each agent i ∈ {1..N } fine-tunes a language model (LM) f i of pre-trained weights θ LM over their private labeled dataset D i = ⟨X i , Y i ⟩ to obtain fine-tuned model weights θ i , where X i ∈ R Ni, * are inputs, Y i ∈ R Ni, * are labels and N i is the number of annotated examples. The agents keep the labeled data set D i private. In addition to the fine-tune model weights f i (•; θ i ), the agents can also optionally disseminate certain statistics S i , as long as these do not leak information about the labeled data set D i . In turn, the developers use the fine-tuned model weights f i (•; θ i ) and statistics S i as inputs to a merging function g. The merging function is applied to a subset of fine tuned models K ⊆ {1..N } (of size K = |K|) to obtain parameters θ M K of a merged model f M K , where θ M K = g(θ K , S K ). In general, we expect the function g to be computationally efficient and to produce θ M K with a closed-form formulation. 

Attention

Linear Fisher and RegMean require Fisher Information matrix or inner product matrices of layer inputs, but neither of them requires training data. For linear models, RegMean produces optimal weights that minimize ℓ 2 -distance to individual model predictions on the corresponding training sets. 𝑌 = 𝑊 ! " 𝑋 ! ⨯ ⨯ ⨯ 𝑌 = 𝑊 # " 𝑋 # ⨯ ⨯ ⨯ 𝑋 ! " 𝑋 ! 𝑋 # " 𝑋 # ⨯ ⨯ ⨯ ⨯ ⨯ ⨯ Metrics:ℓ # Optimal 𝑊 $ ? + 𝑋 ! " 𝑋 ! 𝑋 # " 𝑋 # + -1 𝑊 ! + /2 + Simple Fisher RegMean 𝑋 ! 𝑋 # 𝑊 # 𝑊 ! 𝑊 # 𝑊 ! 𝑊 # 𝐹 ! 𝐹 # ⊙ ⊙

3. REGRESSION MEAN FOR MODEL MERGING

The key role in the model merging setup is played by the merging function g. We start with briefly introducing existing techniques for model merging, followed by the basic intuition for our proposed method, which we then extend to transformer-based language models. The underlying assumption is that the model architecture for all models f i is the same, allowing for element-wise operations if needed and resulting in a merged model f M K of the same architecture and size as any individual model. We also assume models are fine-tuned from the same pretrained LM checkpoint. The study of methods that relax this constraint are outside the scope of this paper and are left for future work.

3.1. PRELIMINARIES

Simple Averaging (Simple) computes the merged weights as the element-wise arithmetic mean of the weights of all other models: θ M K = 1/K i∈K i θ i . This technique was proved to be effective when merging model weights that are already similar or in a similar space, such as checkpoints generated after each epoch in a training process (Wortsman et al., 2022) . We expect simple averaging to under-perform when model weights live in a different space and are substantially different to each other, such as when merging models trained with different data or when performing merging for models fine-tuned after the entire training process, as opposed to synchronizing models after rounds as in the federated learning setup. Fisher-Weighted Averaging (Fisher) aims to address the limitation of simple averaging of weights with potentially different importance. The method relies on computing per-weight importance F i for each individual model i, and reweighting the weights with this importance factor during merging as follows: θ M K = i∈K i F i θ i / i∈K i F i . Here, F i is the diagonal of the Fisher Information matrix, where F i = E x∼Di E y∼p θ (y|x) ∇ θi (log p θi (y|x i )) 2 . Intuitively, F i measures averaged gradient norm of parameters w.r.t. log likelihood of each label, where parameters with high average norms are considered important.

3.2. MERGING LINEAR MODELS

Next, we recast the problem of model merging as a straightforward optimization problem. We start by inferring the optimal solution of merging two linear regression models trained on different data distributions and analyze its relationship to Simple averaging.

Consider two linear models

f 1 (x) = W T 1 x and f 2 (x) = W T 2 x , where x ∈ R m , and W 1 , W 2 ∈ R m×n , that are trained on two different annotated datasets ⟨X 1 , y 1 ⟩, ⟨X 2 , y 2 ⟩ , where X 1 ∈ R N1×m and X 2 ∈ R N2×m . Each row in X i corresponds to a training example. Our goal is to obtain a single merged model f M (x) = W T M x with outputs similar to f 1 on X 1 and f 2 on X 2 . With ℓ 2 distance as the metric, the optimization problem can be formulated as: min W ∥W T X 1 -W T 1 X 1 ∥ 2 + ∥W T X 2 -W T 2 X 2 ∥ 2 . (1) Eq. 1 describes a linear regression problem, where the inputs are [X 1 ; X 2 ] (row concatenation of X 1 and X 2 ) and the targets are [W T 1 X 1 ; W T 2 X 2 ], which has a closed form solution W M = (X T 1 X 1 + X T 2 X 2 ) -1 (X T 1 X 1 W 1 + X T 2 X 2 W 2 ). The algorithm extends to merging K models W i , i ∈ K with little modifications to the optimization problem in Eq. 1: W M = ( i∈K i X T i X i ) -1 i∈K i (X T i X i W i ). We refer to Eq. 2 as Regression Mean (RegMean). To summarize, to merge a linear model f i with other models, we pre-compute the inner product matrices of training data X T i X i ; we do not recompute X T i X i when merging with different models. The merger retrieves the weights and inner product matrices of inputs of individual models and compute the weights as in Eq. 2. Interpretation. RegMean can be also interpreted as reweighting and linearly combing rows in weight matrices, where the diagonal items of X T i X i mainly reweight the rows, while non-diagonal items linearly combine them. In an extreme case when X T i X i is diagonal, RegMean simply reweights the rows in W i by the importance of neurons. Besides, when all X T i X i (or all X i ) are the same, Eq. 2 transforms into simple averaging, i.e., W M = 1/K i∈K i W i .

3.3. REGMEAN FOR TRANSFORMER LANGUAGE MODELS

Transformer models consist of feed forward layers and attention heads where linear layers are important components. For all linear layers, we independently apply RegMean. We record X (j)T i X (j) i of each linear layer f (j) , where X (j) i is the input features of the linear layer. The other types of weights, such as embeddings and bias terms, that represent a small portion of the overall parameter set are merged using simple averaging. Reducing Non-Diagonal Items of Inner Product Matrices. We empirically find that directly applying Eq. 2 for merging yields degenerated models in case of some pre-trained LM architectures. We therefore decrease the non-diagonal items of the inner product matrices by multiplying them with a scalar α (set as 0.9 most of the times). This also corresponds to adding a regularization term in the optimization objective in Eq. 1 that penalizes the Euclidean distance between the merged weights W M and individual model weights W 1..K . We include a formal derivation and proof in Appendix A. We illustrate RegMean in Figure 2 and summarize the complete RegMean method in Algorithm 1.

Algorithm 1: RegMean for Transformer Language Models

Data: Individual Models f1..K , Number of linear layers J, inner product matrices G (j) i = X (j)T i X (j) i for all linear layers 1 ≤ j ≤ J and models 1 ≤ i ≤ K, Scaling factor of non-diagonal items α Result: Merged model fM for j in 1, 2, ..., J do W (j) 1 , W (j) 2 ..., W K ← getLinearWeights(f1..K , j) ; Reduce non-diagonal items of inner product matrices G (j) i as G(j) i ← αG (j) i + (1 -α)diag(G (j) i ) ; W (j) M ← ( i∈K i G(j) i ) -1 i∈K i ( G(j) i W (j) i ) and set the weight as W (j) M in fM end Average weights as WM = 1 K i∈K i Wi for weights other than linear layer weights in fM 3.4 PROPERTIES OF REGMEAN Computational Efficiency. Inner product matrices of all linear layer inputs can be computed within one single forward pass over training data after individual models are trained. It is more efficient than computing Fisher Information matrices, which requires an additional backward pass to compute gradients. Memory Overhead. The memory overhead of inner product matrices is J j=1 d 2 j , where J is the number of linear layers in the model and d j is the input dimension of linear layers. For transformer models, this overhead is comparable to the number of parameters and Fisher Information matrices. Data Privacy. It should be noted that RegMean never requires training data X i when merging; instead, it only requires low-dimensional inner product matrices. The agents that release the models can share the matrices without sharing the private training data and their labels. merged model f M to achieve competitive test performance across all datasets D 1..N . This model is useful for example when the test distribution is a mixture of D 1..N . In addition, a single model has the additional advantage of being able to run inference across multiple domains when the user of the model provides data from one of the domains, but is not aware of the domain label (Wang et al., 2020b) . In our case, D 1..N can represent different non-i.i.d. partitions of the same dataset, different domains for the same task or different tasks altogether. Second, we expect the merged model to achieve higher out-of-domain (OOD) generalization ability. Formally, we evaluate the performance of the merged model f M over the out-of-domain test sets D o 1..No where the data distributions are different from any of D 1..N . Datasets. We use the GLUE datasets (Wang et al., 2018) for studying merging models trained for non-i.i.d. partitions and merging models trained for different tasks. We use emotion classification and named entity recognition (NER) as base tasks for studying merging models trained on different domains of the same task. For emotion classification, we use the collection of preprocessed datasets from (Oberländer & Klinger, 2018) . We choose 5 high-resource datasets for training individual models and 5 low-resources datasets for evaluation of out-of-domain generalization ability. For NER tasks, we use 6 domains in OntoNotes (Hovy et al., 2006) for training individual models, and use CoNLL (Sang & De Meulder, 2003) and Twitter NER (Rijhwani & Preotiuc-Pietro, 2020) to measure out-of-domain generalization performance. We include details of datasets in Apppendix B. Metrics. In the case of merging models trained on non-i.i.d. partitions of the same dataset, we evaluate the merged models over a single test set with a joint distribution of all partitions. For merging models trained on different domains or tasks, we measure the performance over all single domains or tasks incorporated into merging and take their macro-average. For out-of-domain evaluation, we similarly take macro-average over the performance over the out-of-domain test sets.

4.2. COMPARED METHODS

Model Merging. For model merging algorithms, we compare the performance of RegMean with the previously introduced methods of simple averaging (Simple) (Wortsman et al., 2022) and Fisherweighted averaging (Fisher) (Matena & Raffel, 2021) . Model Ensembling. Model ensembling represents an alternative to model merging when access to the original data is not available. We thus build an ensemble model (Ensemble) by obtaining all logits from the individual model predictions and averaging them before doing an argmax. Individual Models. To provide context into the benefits of merging, we report the performance of individual models involved in merging. We thus report: (1) the average performance of all individual models (Avg. f 1..N ); (2) the performance of the best single individual model (Best. f 1..N ), as determined by using the validation set; (3) the performance of the individual models corresponding to the training data set for each test set (Domain-Specific). Multi-task Learning (MTL). We also consider MTL which trains a single model over the joint training data sets D 1..N . We note that the multi-task method should represent an upper-bound for model merging, as multi-task learning has access to the original labeled data which it can leverage to train a better model when compared to dataless approaches such as model merging. Depending on the data sets, the task can be the same (e.g., emotion prediction) or different (e.g., GLUE tasks).

4.3. EXPERIMENT DETAILS

Pre-trained Models. We initialize all models f i using the same architecture and by using the same pre-trained model weights θ LM . We experiment with multiple pre-trained models as starting points for merging. We experiment with both encoder-only models including the classic RoBERTabase (Liu et al., 2019) and state-of-the-art models like DeBERTa-large-v3 (He et al., 2021) and with encoder-decoder models represented by T5-base-v1.1 (Raffel et al., 2020) . We note that T5-base-v1.1 is not applicable to sequence labelling tasks represented by our NER experiments. Further training details are in Appendix B. Model Initialization. It has been shown that model merging is more successful when individual models share the same weight initialization (McMahan et al., 2017) . In this paper, we focus on merging fine-tuned language models of the same architectures and initialized from the same pretrained model weights θ LM before fine-tuning. For new classification heads, we present the results of both shared initialization (Same Head Init, SH) and different initialization (Diff Head Init, Positive values indicate performance improvement after merging. The boxplots summarize results over 10 (C 2 5 ) or 15 (C 2 6 ) combinations of 5 or 6 domain-specific models in Emotion and NER. The triangles denote the mean. Note that y-axes are not in the same scale. DH), as our proposed method is amenable to both. This does not apply to T5 where we fine-tune the pretrained LM head for prediction. Hyperparameters. We set the non-diagonal multiplier α in RegMean to 0.9, with the exception of T5-base models, where it is 0.1. We compute inner product matrices with at most 1, 000 training batches. Sensitivity analysis of hyperparameters is presented in Section 5.3 and Appendix C.

5. RESULTS

The main goal of our experiments is to benchmark the performance of different dataless model merging methods and compare these with individual model performance before merging. In addition, we aim to situate these methods in context of other methods which represent upper bounds due to having access to more information (i.e. data for fine-tuning) than model merging. Our experiments examine knowledge fusion from two perspectives: (1) in-domain performance over test data sets similar to those over which individual models are trained, and (2) out-of-domain generalization performance over data sets from held-out domains or tasks. We study performance dynamics in a range of scenarios ranging in difficulty. First, we study a simple scenario where merging is performed on models are trained on non-i.i.d. partitions of the same data set. Next, we study merging of models trained on different domains of the same task and lastly merging models trained on different tasks. Merging Models Trained on Non-i.i.d. Partitions. We start with a setup in which we merge models trained on non-i.i.d. partitions of the same data set, which is simulated using synthetic data splits over the 8 tasks in the GLUE benchmark. For each task, we split training data into two partitions with 1,000 training examples with different label distributions (details in Appendix B). We then fine-tune 8 pairs of individual models over the two partitions and merge each pair of the models. The merged models are evaluated on the official validation sets (i.e. with a joint distribution of both partitions). In Table 1 , we find that model merging consistently improves over average performance of individual models across the 8 tasks. This verifies that weight merging allows combining knowledge from individual models and can lead to a more powerful single model. We further note that RegMean outperforms simple averaging and is similar in performance to Fisher-weighted averaging. This is a proof-of-concept that model merging and RegMean work in a simple scenario.

5.1. MODEL MERGING FOR FUSING IN-DOMAIN KNOWLEDGE

Merging Models Trained on Different Domains. We next shift to a more challenging setup where individual models are trained on data from different domains of the same task. Pairwise Merging. We start by merging pairs of models trained on different domains. For emotion classification and NER, we have 10 (C 2 5 ) and 15 (C 2 6 ) combinations of domain-specific mod- els respectively. The boxplots in Fig. 3 summarize the relative performance drop compared to domain-specific models as 1 N (N -1) N i=1 N j=1,j̸ =i [M(f Mi,j , D i ) -M(f i , D i )]/M(f i , D i ), where M(f, D) denotes the metric score obtained by evaluating f on the test set of D. The performance drop is reasonable as the merged model can run inference on both domains; when the test set is a mixture of all domains, the merged model usually outperforms single individual models, as we will see in the next paragraph. We see clear differences between model merging algorithms, where RegMean performs the best. On RoBERTa-base and DeBERTa-large, RegMean reduces performance drop on Emotion from 55% to 12% and 85% to 15% compared to simple average. Merging All Domain-Specific Models. We further experiment in a setup of merging all 5 or 6 domain-specific models on Emotion Classification and NER. Table 2 summarizes the results. Results show that merging all models is a challenging setup. The large differences between the average and the best performance of individual models (Avg. f 1..N and Best f 1..N ) indicate the performance of individual models have a high variance. As a result, model ensembling suffers from poor individual models: the improvements are mostly marginal compared to Best f 1..N , while on DeBERTa-large on Emotion, the performance is actually lower. In contrast, MTL improves performance significantly over Best f 1..N and achieves performance similar to or better than domainspecific models, which implies a single model is capable of encoding knowledge of all domains in our setup. We then compare three different merging algorithms. RegMean achieves the best in-domain performance on both Emotion and NER tasks, except for DeBERTa-large on Emotion, where Fisher performs slightly better. Simple averaging performs poorly (except for T5), especially on RoBERTabase and DeBERTa-large in the emotion tasks. We note that Fisher clearly under-performs RegMean in our previous pairwise merging experiments; Fisher-weighted averaging may actually produce a merged model that is very similar to one of the individual model. RegMean also outperforms ensembling in all but one of the five scenarios. RegMean also clearly outperforms Best f 1..N on RoBERTa and T5-base on Emotion, which makes model merging with RegMean useful for performance purposes, in addition to the practical convenience of deploying and maintaining a single model for multiple domains. Merging Models Trained on Different Tasks. We also experiment with merging models trained on different tasks using DistilBERT-base and RoBERTa-base. We train individual models with full training data of 8 GLUE tasks. We do not merge task-specific classification heads as these can have different dimensions depending on the task and output space. We summarize the results in Figure 4 . We again see a similar pattern when comparing model merging techniques with RegMean clearly improving over Simple averaging and Fisher-weighted averaging. Out-of-Domain Generalization when Merging all Domain-Specific Models. Table 3 summarizes OOD generalization performance when merging all domainspecific models. We see a similar pattern in OOD generalization performance where RegMean in general performs the best across all model merging algorithms. The performance is lower than Fisher only on RoBERTa-base and DeBERTa-large with different head initialization. We also see that RegMean outperforms model ensembling in most cases, which is comparable in the amount of information it can use. Further, on the emotion classification data sets, it is notable that RegMean achieves higher OOD performance than the best f 1..N on T5-base. We also found that knowledge fusion itself can negatively impact performance when there are poor individual models: on NER, all merging algorithms and even MTL does not achieve better OOD performance on CoNLL and Twitter than picking the Best f 1..N , as previously indicated in Wang et al. (2020b) .

5.2. MODEL MERGING

Incrementally Merging a Subset of Models. In a scenario where OOD performance of each individual model is known (e.g., when the validation sets of the OOD data sets are provided), we can mitigate the impact of having poor individual models by merging only a subset K ⊆ {1..N} of models. We apply a similar technique as Wortsman et al. (2022) ; Ramé et al. (2022) which greedily identifies new individual models to merge. We use their OOD performance on the validation sets to incrementally add models and plot the results in Figure 6 . In general, merging only a subset of models is better than merging all models, e.g., on RoBERTa-base with the same head initialization, RegMean outperforms Best f 1..N by merging only two models.

5.3. DISCUSSION

Pre-trained Model Impact in Merging. Our results also show that the underlying pre-trained model is an important factor that affects the performance of merged models. Overall, merging T5-base models is successful even with simple averaging, while DeBERTa-large is hard to merge, which hints to an interaction between merge-ability and pre-training objective. We believe a more comprehensive study of such factors is an interesting direction of future work. Impact of Scaling Non-Diagonal Values in Inner Product Matrices. We noticed when α = 1.0 (i.e., no scaling), RegMean yields degenerated performance on T5-base and DeBERTa when merging two models, while slightly decreasing α to 0.9 eliminates the issue. In the other extreme case when α = 0, the inner product matrices become diagonal and RegMean simply reweigh rows of weight matrices, making the method similar to Simple Average. We plot the pairwise merging performance of RegMean with 0 ≤ α ≤ 1 in Figure 5a for T5-base and DeBERTa-large, as well as the performance of merging multiple T5 models in 5b. We observe that the performance of RegMean is mostly stable between α = 0.1 and 0.9, but suddenly drops at α = 1.0. When merging multiple T5-base models, both in-domain and OOD performs reaches maximum at α = 0.1 and slowly drops with an increase in α, whereas OOD performance suffers a slightly larger drop. Limitations. We note that the requirement of inner product matrices in RegMean (and Fisher Information in Fisher-weighted averaging) can be a limitation. To merge existing models released online 2022) study domain-generalization by averaging weights of models trained over the same datasets with different configurations. Matena & Raffel (2021) study merging using Fisher-weighted averaging with the aim of improving performance on a single target task by leveraging other 'donor' tasks. Choshen et al. (2022) show fusing fine-tuned models with simple weight-averaging creates a better starting point of fine-tuning for new tasks. Weight averaging was also used by Li et al. (2022) for building language models with multi-domain capabilities where new domain 'experts' are initialized using weight averaging from the existing experts. Wang et al. (2022) use weight averaging to fuse knowledge learned when training multiple adapters with the aim of obtaining better few-shot capabilities and increased model robustness. Merging updates of private models is a crucial intermediate step in federated learning (McMahan et al., 2017; Li et al., 2019) . However, key in federated learning algorithms is that the joint model is iteratively updated in multiple rounds, which is not allowed for model merging. The success of simple arithmetic mean for model merging has been explained from the perspective of loss landscapes and linear mode connectivity (Frankle et al., 2020; Neyshabur et al., 2020; Draxler et al., 2018; Ainsworth et al., 2022) . Further, improved merging algorithms aim to match permutations between the weights of different models (Singh & Jaggi, 2020; Nguyen et al., 2021; Ainsworth et al., 2022; Wang et al., 2020a) , which is a complementary line of effort to our work. We experiment with permutation matching algorithms and present our analysis in Appendix D. Knowledge Fusing via Distillation. Recent work has used the knowledge distillation framework to fuse the capabilities of multiple teacher models by distilling them into a smaller student model at fine-tuning or pre-training stage (Khanuja et al., 2021) , albeit requiring full access to data for distillation. Dataless distillation, although for computer vision architectures and not using Transformerbased approaches, was attempted in (Lopes et al., 2017; Nayak et al., 2019) . These have the additional disadvantage of not having a closed form solution and are thus not computationally efficient.

7. CONCLUSIONS AND FUTURE WORK

This paper studied the problem of fusing knowledge of multiple fine-tuned language models by model merging without access to training data. We proposed a new method inspired by linear models named Regression Mean (RegMean). We introduced a series of experimental setups in which we demonstrated that our method outperforms other alternatives to dataless merging or ensembling. Further, in non-i.i.d. and out-of-domain experiments, we showed that model merging can outperform individually trained models. Merged models are also very practical, especially when compared to hosting multiple models, as the merging algorithm is very efficient, adds a minimal number of additional parameters and has a similar inference speed to any individual model. The implications of model merging are wide ranging from efficient intermediary-task selection to improve performance to combining models trained with private data in a federated learning setup. Future work can focus on merging models with different initialization or architectures, merging models sequentially at scale or merging pre-trained models before the fine-tuning stage.

A DERIVATION OF THE COMPLETE FORMULATION OF REGMEAN

Consider merging of K linear models. We have the optimization problem formulation, min W i∈K i ∥W T X i -W T i X i ∥ 2 + i∈K i (W -W i ) T Λ i (W -W i ) where for all i, W, W i ∈ R m×n , X i ∈ R Ni×m , and Λ i = diag(λ i1 , λ i2 , ..., λ iK ) ⪰ 0. The second term is a regularization term that encourages W to be close to W i , where λ ij is the regularization strength for j-th row of W i . Here, λ ij can be set as any non-negative values. The optimal solution for this problem is, W M = [ i∈K i (X T i X i + Λ i )] -1 i∈K i [(X T i X i + Λ i )W i ] Proof. We compute the gradient of the objective function (noted as L) w.r.t the merged weight W . ∂L ∂W = i∈K i (-2X T i X i W i + 2X T i X i W ) + i∈K i (-2ΛW i + 2ΛW ) We see L is convex w.r.t. W . Therefore, we may find minizer of L by letting ∂L ∂W = 0. i∈K i (X T i X i W i + ΛW i ) = i∈K i (X T i X i + Λ)W * (6) W * = [ i∈K i (X T i X i + Λ i )] -1 i∈K i [(X T i X i + Λ i )W i ] Usually, in linear regression, the regularization strength Λ i is manually specified as a constant value. However, in our case, the scale of X T i X i may differ a lot across models, layers, or datasets. Therefore, we let Λ i to scale with X T i X i , and set Λ i = γ diag(X T i X i ), where γ is a fixed scalar, so that, W M = [ i∈K i (X T i X i + γ diag(X T i X i ))] -1 i∈K i [(X T i X i + γ diag(X T i X i ))W i ] This formulation is equivalent to increasing the scale of diagonal items of inner product matrices X T i X i . Decreasing all non-diagonal items of inner product matrices by multiplying α = 1 1+γ has the same effect, as we have done in Sec. 3.3. W M = [ i∈K i ( 1 1 + γ X T i X i + γ 1 + γ diag(X T i X i ))] -1 i∈K i [( 1 1 + γ X T i X i + γ 1 + γ diag(X T i X i ))W i ] B DETAILS FOR DATASETS, PREPROCESSING, METRICS, AND TRAINING GLUE. For GLUE (Wang et al., 2018 ) experiments, we use CoLA (Warstadt et al., 2019) , SST-2 (Socher et al., 2013) , MRPC (Dolan & Brockett, 2005) , STS-B (Cer et al., 2017) , MNLI (Williams et al., 2018) ,QNLI (Rajpurkar et al., 2016) , QQP, and RTE (Giampiccolo et al., 2007) To study merging models trained on non-i.i.d. partitions, we construct two partitions for each of the GLUE tasks. We first randomly sample a "key class" from the task and draw 80% of data of the class from the training set and put them into one partition. The rest of the data constitute the other partition. We uniformly draw examples that do not belong to the "key class" from one partition to the other so that two partitions have the same number of examples. We uniformly sub-sample each partition so that each partition has 1,000 training examples. Emotion. For emotion classification, we use the preprocessed datasets by Oberländer & Klinger (2018) . We use DailyDialogs (Li et al., 2017) , CrowdFlower, TEC (Mohammad, 2012) , Tales-Emotion (Alm et al., 2005) , and ISEAR (Scherer & Wallbott, 1994) for training domain-specific models. We use Emoint (Mohammad & Bravo-Marquez, 2017) , SSEC (Schuff et al., 2017) , ElectoralTweets (Mohammad et al., 2015) , GroundedEmotions (Liu et al., 2017) , and AffectiveText (Strapparava & Mihalcea, 2007) as held-out datasets for evaluating out-of-domain generalization. All the selected datasets have the classes anger, disgust, fear, joy, sadness, surprise in their label space, while some of them have more classes (e.g. guilt). For in-domain performance of each dataset, we compute Macro-F1 of all classes that present in the dataset. For out-of-domain performance, we only compute Macro-F1 over anger, disgust, fear, joy, sadness, surprise. In some of the datasets, inputs may be associated with multiple emotion labels. We therefore formulate the emotion classification task as a multi-label classification task for all datasets. On RoBERTa and DeBERTa, we create a binary classification head for each class. We exclude the classification heads that are not learned in the training process when merging the weights of classification heads -e.g. if one dataset has the class "guilt" but the other does not, the weights of the classification head for "guilt" of the other model will not be used for merging. For T5, we reformulate the task into a sequence-tosequence format with the template: does the sentence express {class name}? {sentence}. with possible outputs yes or no. Such an example will be created for each class that present in the dataset. During evaluation, we treat the exact match yes as the the prediction of the positive label, and otherwise treat as prediction of the negative label. NER. We use 6 domains (newswire, broadcast news, broadcast conversation, magazine, telephone conversation and web data) in OntoNotes (Hovy et al., 2006) Implementation. We use huggingface's transformer library (Wolf et al., 2019) to download pretrained LM checkpoints and fine-tune the models. We specifically note that we use the forward function hook feature in PyTorch (Paszke et al., 2019) to obtain the inputs of all linear layers in order to compute inner product matrices. It makes the code implementation of RegMean agnostic to the model architecture. Training Details. We fine-tune DistilBERTbase, RoBERTa-base, and DeBERTa-large with an initial learning rate 1e-5, and fine-tune T5base with an initial learning rate 1e-4. We use AdamW optimizer throughout the experiments. The learning rate gradually warms up in the first 6% of training steps and linearly decay to 0. We train models with a batch size of 16 and for 10 epochs on GLUE, 30 epochs on emotion classification and 20 epochs on NER. We evaluate the performance of the model after each epoch and resume the best performing checkpoint at the end of training.

C SENSITIVITY ANALYSIS

Number of batches for computing inner product matrices. In our main experiments, we use N = 1, 000 batches (of size 16) for computing inner product matrices. We present additional analysis about the effect of N and summarize results in Table 6 . In general, performance improves as we increase N , but the performance soon saturates around N = 100. In-domain F1 Adding a constant to diagonals β = 0.01 28.24 β = 0.1 33.74 β = 0.2 39.13 β = 0.5 34.70 Relative scaling of non-diagonals α = 0.1 40.32 Alternative methods for regularization. As we mentioned in Sec. 3.3 and Appendix A, we reduce non-diagonal items of inner product matrices by a fixed scale α, which has a regularization effect of encouraging merged weights to be closer to individual model weights. Here we present analysis of an alternative regularization method, which adds a fixed scalar β to diagonal items instead of relatively scaling them. We experiment with emotion classification on T5 where regularization seems to be most necessary. We merge each pair of models on 5 emotion classification datasets and report the average performance over all pairs (a setting similar to Figure . 3) in Table 8 . We see relative scaling achieves clearly better performance than adding a constant to diagonals. As we mentioned in Appendix A, this may be caused by differences in the scale of inputs in different layers, models, and datasets, which makes it difficult to find a single additive regularizer. Assuming no permutations in weights, we should expect the diagonal items of M (distance of weight vectors in the corresponding positions) to be much smaller than non-diagonal items. Otherwise, we may obtain non-trivial permutations by solving an optimal transport problem with M . In Figure 7 , we visualize the matrix M on the two-layer MLP after each transformer block, which is the only place where linear layers are stacked without residual connections in transformers, making weight permutations most likely to happen. However, in Figure 7 , we see a clear picture that the diagonal items of M are significantly smaller than non-diagonals. The results imply there is no permutations in weights. In this case, the permutation matrix we obtain by solving optimal transport is a trivial identity matrix. We conjecture that sharing the same pretrained LM weight initialization contributes to stability in training, resulting in no permutations in weights. The residual connections in transforms may further prevent weights in other modules from getting permuted. Activation-Based Matching. We apply activation-based matching in Git Re-Basin (Ainsworth et al., 2022) . The algorithms relies on a similarity matrix C ∈ R n×n that measures pairwise similarity of activations over N training examples in a certain layer. More formally, C is computed as Z T A Z B , where Z A , Z B ∈ R N ×n are activations at a given layer in the models f A and f B . The algorithm solves a linear assignment problem with C to obtain permutations in activations. Similarity, if there is no permutation, we expect the diagonal items of C to be large. We visualize the matrix C in Figure 8 . We see a different picture from weight-based matching that C is far from being diagonal. This allows activation-based matching algorithms to produce non-trivial permutation matrices. However, as we apply these permutations, we obtain performance that is far below simple average without matching. We conjecture that in our setup permutations of activations could not faithfully represent permutations in weights. Though we just present empirical findings in this paper, we consider figuring out the reasons for such discrepancy as an interesting future work.



The code is available at: https://github.com/bloomberg/dataless-model-merging EXPERIMENTAL SETUP4.1 EVALUATION SETTINGS We expect two major benefits of merging models for the developer. First, by combing knowledge of individual models f 1..N (or a subset K of them, f K ) trained on D 1..N , we expect the resulting



Figure 1: Diagram containing the problem formation for model merging and its comparison to other setups including multi-task learning, model ensembling and federated learning. Models f 1..N trained by individuals or organizations are released to the user (optionally with some statistics) but the training data D 1..N is kept private.

Figure 2: Comparison between Simple, Fisher, and RegMean for merging transformer-based language models.

Figure 3: Relative performance drop (%) of pairwise merged models compared to the domain-specific models.

Figure 4: Relative performance drop (%) of merged models compared to task-specific models in our pairwise model merging experiments over GLUE.

Figure 5: Performance of RegMean with different values of α in Emotion Classification. * denotes for Simple Average.

Figure 6: Examples of improved out-of-domain generalization performance when incrementally merging a subset of individual models in the order of their OOD performance compared to merging all models. The main comparison is against the best individual model f1..N (shown in the dashed line). without these statistics, a few training examples (see Appendix C for the sensitivity to the number of training examples) are needed to compute them. Besides, there is a risk that inner product matrices may reveal information about training data. Quantitatively measuring information leakage in these statistics should be a good direction of research in the area of privacy.

Merging models trained on Non-i.i.d. partitions of GLUE tasks. We compare the performance of the merged models (Simple, Fisher, RegMean) and the average performance of each pair of individual models (Avg. f1..N ) over the joint validation sets.

In-domain performance when merging all 5 emotion classification models or 6 NER models. Simple, Fisher and RegMean are the model merging algorithms for comparison. Bold numbers indicate the best performance across different model merging algorithms.

Out-of-domain performance when merging all 5 emotion classification models or 6 NER models.

datasets the GLUE task collections. We run evaluation on the official development sets because test labels are hidden. We compute Matthews Correlation for CoLA, Pearson Correlation for STS-B, and accuracy for all other tasks.

Statistics of emotion classification datasets.

Table 4 summarizes statistics of the datasets.

Statistics of NER datasets.

OOD performance when merging two RoBERTa-base emotion classification models (with same head initialization) with RegMean. Diagonal items represent OOD performance of individual models. We show OOD performance is dependent on the models used for merging.

Enumerating

Comparison of performing regularization by adding a constant to diagonals or relative scaling of non-diagonals of inner product matrices. We merge T5base Emotion Classification models and evaluate average in-domain F1.

ACKNOWLEDGMENTS

Xisen Jin is supported by a Bloomberg Data Science Ph.D. Fellowship.

annex

Smaller values are highlighted in the heatmaps. We fine-tune RoBERTa-base models on two different emotion classification datasets. The resulting matrix T is used as ground metrics for computing optimal transport in weight-based matching in (Singh & Jaggi, 2020) .Choice of models to merge and its effect on OOD performance. Table 7 summarizes OOD performance when merging each pair of RoBETa-base emotion classification models with same head initialization with RegMean. We see the OOD performance is clearly dependent on the models chosen for merging. Merging TEC and ISEAR models, which correspond to two individual models that achieve best OOD performance, produces a model that achieves best OOD performance.

D PERMUTATION MATCHING ALGORITHMS FOR MERGING LANGUAGE MODELS

Several existing works (Singh & Jaggi, 2020; Ainsworth et al., 2022) propose algorithms to match weight permutations in two models before merging, as models with similar outputs may involve distinct permutations in their weights. However, experiments in these works do not cover transformers LMs. In this section, we present an analysis to address two research questions about permutation matching algorithms in the setup of merging language models fine-tuned from shared pretrained weights: (1) does the issue of weight permutation exist in this setup? (2) do existing permutation matching algorithms improve the performance of model merging?We experiment with merging two RoBERTa-base models fine-tuned on emotion classification datasets. We visualize results on merging models trained on Tales-Emotion and ISEAR in Figures 7 and 8 .Weight-Based Matching. We apply weight-based matching in OTFusion (Singh & Jaggi, 2020) . To find permutations between weight matrices W A and W B in the same layer of two different models, the algorithm computes a ground metrics matrix M ∈ R n×n , where n is the dimension of the output. Each element M ij ∈ M measures ℓ 2 distance between a pair of weight vectors W :,i A and W 

