DATALESS KNOWLEDGE FUSION BY MERGING WEIGHTS OF LANGUAGE MODELS

Abstract

Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-ofdomain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios. 1

1. INTRODUCTION

The dominant paradigm for solving NLP tasks ranging from classification to sequence tagging involves fine-tuning a pretrained language model (PLM) using task-specific labeled data (Devlin et al., 2019; He et al., 2021) . This results in specialized models that are explicitly trained to run inference over a single domain and task. Multi-task learning has shown that leveraging information across domains or tasks can be beneficial if the data sets, data set size and algorithms are well selected (Phang et al., 2018; Pruksachatkun et al., 2020; Poth et al., 2021; Weller et al., 2022) . Combining knowledge of multiple data sets in a single model can lead to better overall performance on in-domain data (Poth et al., 2021) , can better generalize on out-of-domain data (Wang et al., 2020b) and results in a model that is more practical and parameter efficient than maintaining specialized models. However, the multi-task learning setup suffers from two practical limitations. First, the training process requires access to the original labeled data, which may not be realistic as annotated data may be private to the agent fine-tuning the model which can happen in order to ensure data or annotation privacy or to guard intellectual property to annotations. Second, because a significant amount of data or task combinations are not beneficial to performance (Poth et al., 2021) , building a single model requires training on all data set combinations to identify the optimal one, which can be prohibitive especially if there are many available source data sets or models. Model merging is defined as combining multiple models into a single one in parameter space without access to data (Matena & Raffel, 2021) . This technique provides an alternative to building a single model while satisfying data privacy constraints. Weight merging algorithms usually also have a closed-form solution, making them very efficient as no retraining is necessary, thus enabling usage even when a large number of data sets or model combinations are available. Merging can be considered as an alternative to model ensembling (Opitz & Maclin, 1999; Rokach, 2010) , where the This dataless model merging is thus an extreme case of federated learning, where a single round of synchronization is admissible. Figure 1 provides an overview of the various related setups. We thus aim to use model merging to build a single model that can be used for inference on multiple domains or tasks and can generalize to new domains, in line with Wang et al. (2020b) . In contrast, simple averaging of weights for model merging was used by existing works such as Wortsman et al. ( 2022) to improve the performance of a specific model, where weight averaging was done over models fine-tuned using the same data set with different hyperparameters. Separately, Matena & Raffel (2021) focus on improving performance over a single target task by leveraging models trained on other donor tasks by merging models using Fisher-weighted averaging. This paper focuses on merging fine-tuned models that originate from pre-trained language models with the same architecture and pretrained weights. We introduce a novel model merging method named Regression Mean (RegMean), which is computationally efficient and extendable to merging any number of models. The method is inspired by the optimal solution for linear models that minimizes ℓ 2 distance between merged and individual models and has a closed form solution. We evaluate model merging algorithms in setups that range in complexity and type of fused knowledge. The experimental results across multiple model types (e.g. RoBERTa, T5, DeBERTa) show that our proposed method consistently and significantly outperforms other model merging and ensembling baselines and achieves higher generalization performance than the best individual models on out-of-domain data sets across several data collections. Our contributions are three-fold: (1) A novel model merging algorithm (Regression Mean); (2) an evaluation protocol for model merging algorithms that tests both in-domain and out-of-domain generalization ability; (3) analysis of computation and parameter efficiency across setups.

2. DATALESS MODEL MERGING FOR KNOWLEDGE FUSION

We consider the problem formulation that there are two main roles in the framework: (1) the agents (e.g., individuals or organizations) that train and release models; (2) the developers who aim to build a single model by fusing knowledge of multiple available models. Each agent i ∈ {1..N } fine-tunes a language model (LM) f i of pre-trained weights θ LM over their private labeled dataset D i = ⟨X i , Y i ⟩ to obtain fine-tuned model weights θ i , where X i ∈ R Ni, * are inputs, Y i ∈ R Ni, * are labels and N i is the number of annotated examples. The agents keep the labeled data set D i private. In addition to the fine-tune model weights f i (•; θ i ), the agents can also optionally disseminate certain statistics S i , as long as these do not leak information about the labeled data set D i . In turn, the developers use the fine-tuned model weights f i (•; θ i ) and statistics S i as inputs to a merging function g. The merging function is applied to a subset of fine tuned models K ⊆ {1..N } (of size K = |K|) to obtain parameters θ M K of a merged model f M K , where θ M K = g(θ K , S K ). In general, we expect the function g to be computationally efficient and to produce θ M K with a closed-form formulation.



The code is available at: https://github.com/bloomberg/dataless-model-merging



Figure 1: Diagram containing the problem formation for model merging and its comparison to other setups including multi-task learning, model ensembling and federated learning. Models f 1..N trained by individuals or organizations are released to the user (optionally with some statistics) but the training data D 1..N is kept private.outputs of individual models are combined to produce the final prediction. Model merging algorithms are a key step in federated learning(McMahan et al., 2017; Lin et al., 2022), where multiple agents train their own model using private data and share only model updates with other models. However, in federated learning, model merging happens in multiple rounds of updates, after which the merged model is broadcast to all agents before the next round of training with private data. This dataless model merging is thus an extreme case of federated learning, where a single round of synchronization is admissible. Figure1provides an overview of the various related setups.

