

Abstract

Neural networks are known to be biased towards learning mechanisms that help identify spurious attributes, yielding features that do not generalize well under distribution shifts. To understand and address this limitation, we study the geometry of neural network loss landscapes through the lens of mode connectivity, the observation that minimizers of neural networks are connected via simple paths of low loss. Our work addresses two questions: (i) do minimizers that encode dissimilar mechanisms connect via simple paths of low loss? (ii) can fine-tuning a pretrained model help switch between such minimizers? We define a notion of mechanistic similarity and demonstrate that lack of linear connectivity between two minimizers implies the corresponding models use dissimilar mechanisms for making their predictions. This property helps us demonstrate that naïve fine-tuning can fail to eliminate a model's reliance on spurious attributes. We thus propose a method for altering a model's mechanisms, named connectivity-based fine-tuning, and validate its usefulness by inducing models invariant to spurious attributes.

1. INTRODUCTION

Deep neural networks (DNNs) suffer from various robustness problems, learning representations that fail to generalize well beyond the given training distribution (D'Amour et al., 2020; Teney et al., 2022; Geirhos et al., 2020; Recht et al., 2019; Taori et al., 2020; Jacobsen et al., 2018) . This lack of robustness is generally a consequence of models learning mechanisms that rely on spurious attributes in the training data for making their predictions. Such attributes-even if not perfectly predictive-tend to be simpler to represent according to the model's inductive biases (Nakkiran et al., 2019; Valle-Perez et al., 2018; Hu et al., 2020; Shah et al., 2020) and commonly emerge due to sampling biases and hidden confounders in static datasets (Kaur et al., 2022; Lee et al., 2022) . For example, in most vision datasets, backgrounds are correlated with object categories-a sampling bias (Beery et al., 2018; Xiao et al., 2020) . Consequently, a model can learn to predict the correct category of an object by learning mechanisms to identify either its background or its shape; however, only models that rely on shape are likely to generalize robustly (Geirhos et al., 2018; Dittadi et al., 2020) . Indeed, Scimeca et al. ( 2021); Hermann & Lampinen (2020) show that using different datasets for a task, standard training pipelines can induce models that use entirely distinct mechanisms for making their predictions, performing equally well in-distribution, but vastly differently out-of-distribution. Recent works on improving neural networks robustness thus advocate a need for modeling the causal mechanisms underlying the data-generating process (Arjovsky et al., 2019; Krueger et al., 2021; Lu et al., 2021) , promoting representations invariant to spurious attributes. This Work: In this paper, we introduce the idea of mechanistic similarity (Sec. 3) to assess whether two models rely on the same input attributes for making their predictions. Specifically, we call two models mechanistically similar if they exhibit invariance to the same attributes of an input, but may otherwise produce different representations for it. Our motivating question is whether a model can be fine-tuned to alter its mechanisms, i.e., to learn different invariances; we call this the problem of mechanistic fine-tuning (Sec. 5). For instance, if a model has learned to rely on a spurious attribute in its training data, can that reliance be eliminated by training it on a minimal set of "clean" samples that do not contain the spurious attribute? This problem is of practical value because curating a large, clean dataset and training from scratch on it in the first place can be expensive. We consider the problem above through the lens of mode connectivity in neural networks, which refers to the phenomenon that neural network minimizers identified via training on the same dataset for a task tend to be connected via relatively simple paths of low loss in the model's loss landscape (e.g., 1

