LINEAR CONNECTIVITY REVEALS GENERALIZATION STRATEGIES

Abstract

In the mode connectivity literature, it is widely accepted that there are common circumstances in which two neural networks, trained similarly on the same data, will maintain loss when interpolated in the weight space. In particular, transfer learning is presumed to ensure the necessary conditions for linear mode connectivity across training runs. In contrast to existing results from image classification, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the clustermodels that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies. For example, on MNLI, one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions in standard finetuning settings.

1. INTRODUCTION

Recent work on the geometry of loss landscapes has repeatedly demonstrated a tendency for fully trained models to fall into a single linearly-connected basin of the loss surface across different training runs (Entezari et al., 2021; Frankle et al., 2020; Neyshabur et al., 2020) . This observation has been presented as a fundamental inductive bias of SGD (Ainsworth et al., 2022) , and linear mode connectivity (LMC) has been linked to in-domain generalization behavior (Frankle et al., 2020; Neyshabur et al., 2020) . However, these results have relied exclusively on a single task: image classification. In fact, methods relying on assumptions of LMC can fail when applied outside of image classification tasks (Wortsman et al., 2022) , but other settings such as NLP nonetheless remain neglected in the mode connectivity literature. In this work, we study LMC in several text classification tasks, repeatedly finding counterexamples where multiple basins are accessible during training by varying data order and classifier head initialization. Furthermore, we link a model's basin membership to a real consequence: behavior under distribution shift. In NLP, generalization behavior is often described by precise rules and heuristics, as when a language model observes a plural subject noun and thus prefers the following verb to be pluralized (the dogs play rather than plays). We can measure a model's adherence to a particular rule through the use of diagnostic or challenge sets. Previous studies of model behavior on out-of-distribution (OOD) linguistic structures show that identically trained finetuned models can exhibit variation in their generalization to diagnostic sets (McCoy et al., 2020; Zhou et al., 2020) . For example, many models perform well on in-domain (ID) data, but diagnostic sets reveal that some of them deploy generalization strategies that fail to incorporate position information (McCoy et al., 2019) or are otherwise brittle. These different generalization behaviors have never been linked to the geometry of the loss surface. In order to explore how barriers in the loss surface expose a model's generalization strategy, we will consider a variety of text classification tasks. We focus on Natural Language Inference (NLI; Williams et al., 2018; Consortium et al., 1996) , as well as paraphrase and grammatical acceptability tasks. Using standard finetuning methods, we find that in all three tasks, models that perform similarly on the same diagnostic sets are linearly connected without barriers on the ID loss surface, but they tend to be disconnected from models with different generalization behavior. Our code and models are public. 1 Our main contributions are: • In contrast with existing work in computer vision (Neyshabur et al., 2020) , we find that transfer learning can lead to different basins over different finetuning runs (Section 3). We develop a metric for model similarity based on LMC, the convexity gap (Section 4), and an accompanying method for clustering models into basins (Section 4.1). • We align the basins to specific generalization behaviors (Section 4). In NLI (Section 2.1), they correspond to a preference for either syntactic or lexical overlap heuristics. On a paraphrase task (Section 2.2), they split on behavior under word order permutation. On a linguistic acceptability task, they reveal the ability to classify unseen linguistic phenomena (Appendix A). • We find that basins trap a portion of finetuning runs, which become increasingly disconnected from the other models as they train (Section 4.2). Connections between models in the early stages of training may thus predict final heuristics.

2. IDENTIFYING GENERALIZATION STRATEGIES

Finetuning on standard GLUE (Wang et al., 2018) datasets often leads to models that perform similarly on in-domain (ID) test sets (Sellam et al., 2021) . In this paper, to evaluate the functional differences between these models, we will measure generalization to OOD domains. We therefore study the variation of performance on existing diagnostic datasets. We call models with poor performance on the diagnostic set heuristic models while those with high performance are generalizing models. We study three tasks with diagnostic sets: NLI, paraphrase, and grammaticality (the latter in Appendix A). All models are initialized from bert-base-uncased 2 with a linear classification head and trained with Google's original trainer. 3 The only difference between models trained on a particular dataset is the random seed which determines both the initialization of the linear classification head and the data order; we do not deliberately introduce preferences for different generalization strategies.

2.1. NATURAL LANGUAGE INFERENCE

NLI is a common testbed for NLP models. This binary classification task poses a challenge in modeling both syntax and semantics. The input to an NLI model is a pair of sentences such as: {Premise: The dog scared the cat. Hypothesis: The cat was scared by the dog.} Here, the label is positive or entailment, because the hypothesis can be inferred from the premise. If the hypothesis were, "The dog was scared by the cat", the label would be negative or non-entailment. We use the MNLI (Williams et al., 2018) corpus, and inspect losses on the ID "matched" validation set. NLI models often "cheat" by relying on heuristics, such as overlap between individual lexical items or between syntactic constituents shared by the premise and hypothesis. If a model relies on lexical overlap, both the entailed and non-entailed examples above might be given positive labels, because all three sentences contain "scared", "dog", and "cat". McCoy et al. (2019) responded to these shortcuts by creating HANS, a diagnostic set of sentence pairs that violate such heuristics: • Lexical overlap (HANS-LO): Entails any hypothesis containing the same words as the premise. • Subsequence: Entails any hypothesis containing contiguous sequences of words from the premise. • Constituent: Entails any hypothesis containing syntactic subtrees from the premise. Unless otherwise specified, we use the non-entailing HANS subsets for measuring reliance on heuristics, so higher accuracy on HANS-LO indicates less reliance on lexical overlap heuristics. 



Models: https: //huggingface.co/connectivity 2 https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_ H-768_A-12.zip 3 https://github.com/google-research/bert QQP models are trained with Google's recommended default hyperparameters (details in Appendix F). MNLI models are those provided by McCoy et al. (2020).

