LINEAR CONNECTIVITY REVEALS GENERALIZATION STRATEGIES

Abstract

In the mode connectivity literature, it is widely accepted that there are common circumstances in which two neural networks, trained similarly on the same data, will maintain loss when interpolated in the weight space. In particular, transfer learning is presumed to ensure the necessary conditions for linear mode connectivity across training runs. In contrast to existing results from image classification, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the clustermodels that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies. For example, on MNLI, one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions in standard finetuning settings.

1. INTRODUCTION

Recent work on the geometry of loss landscapes has repeatedly demonstrated a tendency for fully trained models to fall into a single linearly-connected basin of the loss surface across different training runs (Entezari et al., 2021; Frankle et al., 2020; Neyshabur et al., 2020) . This observation has been presented as a fundamental inductive bias of SGD (Ainsworth et al., 2022) , and linear mode connectivity (LMC) has been linked to in-domain generalization behavior (Frankle et al., 2020; Neyshabur et al., 2020) . However, these results have relied exclusively on a single task: image classification. In fact, methods relying on assumptions of LMC can fail when applied outside of image classification tasks (Wortsman et al., 2022) , but other settings such as NLP nonetheless remain neglected in the mode connectivity literature. In this work, we study LMC in several text classification tasks, repeatedly finding counterexamples where multiple basins are accessible during training by varying data order and classifier head initialization. Furthermore, we link a model's basin membership to a real consequence: behavior under distribution shift. In NLP, generalization behavior is often described by precise rules and heuristics, as when a language model observes a plural subject noun and thus prefers the following verb to be pluralized (the dogs play rather than plays). We can measure a model's adherence to a particular rule through the use of diagnostic or challenge sets. Previous studies of model behavior on out-of-distribution (OOD) linguistic structures show that identically trained finetuned models can exhibit variation in their generalization to diagnostic sets (McCoy et al., 2020; Zhou et al., 2020) . For example, many models perform well on in-domain (ID) data, but diagnostic sets reveal that some of them deploy generalization strategies that fail to incorporate position information (McCoy et al., 2019) or are otherwise brittle. These different generalization behaviors have never been linked to the geometry of the loss surface. In order to explore how barriers in the loss surface expose a model's generalization strategy, we will consider a variety of text classification tasks. We focus on Natural Language Inference (NLI; Williams et al., 2018; Consortium et al., 1996) , as well as paraphrase and grammatical acceptability 1

