EDITING MODELS WITH TASK ARITHMETIC

Abstract

Changing how pre-trained models behave-e.g., improving their performance on a downstream task or mitigating biases learned during pre-training-is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around task vectors. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Negating a task vector decreases performance on the target task, with little change in model behavior on control tasks. Moreover, adding task vectors together can improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form "A is to B as C is to D", combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training. Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models.

1. INTRODUCTION

Pre-trained models are commonly used as backbones of machine learning systems. In practice, we often want to edit models after pre-training, 1 to improve performance on downstream tasks [105; 100; 63; 39] , mitigate biases or unwanted behavior [85; 59; 82; 71] , align models with human preferences [4; 74; 44; 32] , or update models with new information [104; 15; 69; 70] . In this work, we present a new paradigm for editing neural networks based on task vectors, which encode the information necessary to do well on a given task. Inspired by recent work on weight interpolation [27; 100; 63; 99; 39; 55; 2; 20] , we obtain such vectors by taking the weights of a model fine-tuned on a task and subtracting the corresponding pre-trained weights (Figure 1a ). We show that we can edit a variety of models with task arithmetic-performing simple arithmetic operations on task vectors (Figure 1b-d ). For example, negating a vector can be used to remove undesirable behaviors or unlearn tasks, while adding task vectors leads to better multi-task models, or even improves performance on a single task. Finally, when tasks form an analogy relationship, task vectors can be combined to improve performance on tasks where data is scarce. Forgetting via negation. Users can negate task vectors to mitigate undesirable behaviors (e.g., toxic generations), or even to forget specific tasks altogether, like OCR. In Section 3, we negate a task vector from a language model fine-tuned on toxic data [77; 8], reducing the proportion of generations classified as toxic, with little change in fluency. We also negate task vectors for image classification tasks, resulting in substantially lower accuracy on the task we wish to forget with little loss on ImageNet accuracy [16] .

