DATALESS KNOWLEDGE FUSION BY MERGING WEIGHTS OF LANGUAGE MODELS

Abstract

Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-ofdomain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios. 1

1. INTRODUCTION

The dominant paradigm for solving NLP tasks ranging from classification to sequence tagging involves fine-tuning a pretrained language model (PLM) using task-specific labeled data (Devlin et al., 2019; He et al., 2021) . This results in specialized models that are explicitly trained to run inference over a single domain and task. Multi-task learning has shown that leveraging information across domains or tasks can be beneficial if the data sets, data set size and algorithms are well selected (Phang et al., 2018; Pruksachatkun et al., 2020; Poth et al., 2021; Weller et al., 2022) . Combining knowledge of multiple data sets in a single model can lead to better overall performance on in-domain data (Poth et al., 2021) , can better generalize on out-of-domain data (Wang et al., 2020b) and results in a model that is more practical and parameter efficient than maintaining specialized models. However, the multi-task learning setup suffers from two practical limitations. First, the training process requires access to the original labeled data, which may not be realistic as annotated data may be private to the agent fine-tuning the model which can happen in order to ensure data or annotation privacy or to guard intellectual property to annotations. Second, because a significant amount of data or task combinations are not beneficial to performance (Poth et al., 2021) , building a single model requires training on all data set combinations to identify the optimal one, which can be prohibitive especially if there are many available source data sets or models. Model merging is defined as combining multiple models into a single one in parameter space without access to data (Matena & Raffel, 2021) . This technique provides an alternative to building a single model while satisfying data privacy constraints. Weight merging algorithms usually also have a closed-form solution, making them very efficient as no retraining is necessary, thus enabling usage even when a large number of data sets or model combinations are available. Merging can be considered as an alternative to model ensembling (Opitz & Maclin, 1999; Rokach, 2010) , where the



The code is available at: https://github.com/bloomberg/dataless-model-merging 1

