DON'T FORGET THE NULLSPACE! NULLSPACE OCCU-PANCY AS A MECHANISM FOR OUT OF DISTRIBUTION FAILURE

Abstract

Out of distribution (OoD) generalization has received considerable interest in recent years. In this work, we identify a particular failure mode of OoD generalization for discriminative classifiers that is based on test data (from a new domain) lying in the nullspace of features learnt from source data. We demonstrate the existence of this failure mode across multiple networks trained across RotatedMNIST, PACS, TerraIncognita, DomainNet and ImageNet-R datasets. We then study different choices for characterizing the feature space and show that projecting intermediate representations onto the span of directions that obtain maximum training accuracy provides consistent improvements in OoD performance. Finally, we show that such nullspace behavior also provides an insight into neural networks trained on poisoned data. We hope our work galvanizes interest in the relationship between the nullspace occupancy failure mode and generalization.

1. INTRODUCTION

Neural networks often succeed in learning rich function approximators that generalize remarkably well to the distribution they are trained on, but are often brittle when exposed to inputs that come from a different distribution (Gulrajani & Lopez-Paz, 2020) . With rapid adoption of neural networks to various safety critical applications such as autonomous driving, healthcare etc. more attention is being paid to the question of robustness under domain shift (Alcorn et al., 2018; Dai & Van Gool, 2018; AlBadawy et al., 2018) . Huh et al. (2021) hint that overparameterized, deep neural networks are biased to learn functions with (approximate) low-rank covariance structure and posit that this might be related to the phenomenon of implicit regularization (Galanti & Poggio, 2022) that has been used to explain in-distribution generalization of deep networks.

Recent findings from

How might such low-rank structure relate to out-of-distribution generalization? As a simple thought experiment, consider a setting where training data D train is embedded in a three dimensional space (v 1 , v 2 , v 3 ) that exhibits variance only along the first two dimensions (fig. 1  (left)) (with v 3 = 1) 1 . Let us train a neural network f θ on this data using a loss functional L(f, D train ). Since v 3 does not contribute to any reduction in training error, standard empirical risk minimization (ERM) (Vapnik, 1999) training need not differentiate between functions f which handle v 3 in different ways. Now consider an out-of-distribution (OoD) dataset which has the same structure as the original dataset along v 1 and v 2 , but the value of v 3 now has a different value, e.g. 10. In this case, one would incur an error (fig. 1 , right) if one learns a function f where f (•, •, v 3 = 1) ̸ = f (•, •, v 3 = 10). Thus, the low-rank simplicity bias, while beneficial for IID generalization (Huh et al., 2021) can potentially cause issues for OoD generalization. In cases where removing the "additional" features observed at test time improves performance (such as in fig. 1 ) we say that the network incurs nullspace error and call the failure mode as "nullspace occupancy". When diagnosing nullspace occupancy related failure, it is important to properly choose the representational basis. Our key technical contribution is combining notions of variation in training data with utility for the downstream network f in order to identify the most important directions for projection. We formalize this as an optimization problem that we solve via projected gradient descent onto orthogonal matrices (Kiani et al., 2022) . Experimentally, we demonstrate the existence of nullspace occupancy related performance degradations across different architectures and datasets evaluated on the DomainBed benchmark (table 2 ). This empirically establishes that nullspace occupancy is an issue for neural networks in OoD settings and suggests further performance improvements by mitigating it. We take first steps towards showing this in practice in the leave-one-out-validation setting from Gulrajani & Lopez-Paz ( 2020) to improve OoD performance on DomainBed. Overall, our contributions are as follows: • We identify a nullspace occupancy based failure mode for OoD generalization • We demonstrate that this failure mode exists for models trained using ERM on DomainBed • We observe that selecting a few projecting directions with high training accuracy yields the maximum potential improvements using this approach • Interestingly, we also find that in a data poisoning setting (Huang et al., 2020) , the network exploits this nullspace occupancy phenomenon to learn a poorly generalizing classifier (section 4)

2.1. FEATURE PROJECTION MECHANICS

We work in the standard multi-class classification setting with inputs x and labels y. Let z = g(x) ∈ R K be the features extracted in an intermediate layer of the network l using an encoder g, and µ = 1 N N i=1 z i be the mean of the features z i extracted on a training dataset with N datapoints. Let f be the classification function that maps a feature z to the logits. Finally, let Given a test datapoint x, with a corresponding extracted feature ẑ, one can project the datapoint onto a basis V m of rank m as follows: V ∈ R K×K = [v 1 , • • • , v K ] be an orthonormal matrix, and a rank m subspace of V is V m ∈ R K×m = [v 1 , • • • , v m ]. ẑm = V m V T m (ẑ -µ) + µ (1) For a convolutional network, we perform projection only along the channel dimension of the featurization. That is, for an intermediate representation with n c channels and spatial size H × W , we consider a set of features V m of dimensionality n c . For each spatial location (i, j) ∈ [H] × [W ], we project a vector z ij as follows: ẑm ij = V m V T m (ẑ ij -µ ij ) + µ ij (2) 2.2 WHAT IS THE RIGHT CHOICE OF BASIS V ? We aim to find a basis V that leads to the highest decrease in the training loss L train as we increase the rank m. This ensures that we cover the most important directions for the loss functional L(f, D train ). Intuitively, this incorporates both notions of training sensitivity as well as feature spread, since directions with a lot of spread of training data and where the function has high sensitivity would also decrease the training loss.



This is a special case ofHuh et al. (2021) where the third eigenvalue is 0, instead of being very small



Figure 1: Illustration of nullspace failure. Left: Training data with variation in v 1 , v 2 but no variation in v 3 = 1. Black line: Decision boundary learnt by a 3 hidden layer MLP function f with inputs (v 1 , v 2 , v 3 ) visualized with v 3 = 1. Right: Decision boundary of the same classifier evaluated on the plane v 3 = 10. f is sensitive to v 3 , causing nullspace error (red box) on test data.

Figure 2: Our methods add projection layers (orange) to an existing pretrained network with an encoder g θ (till layer l) and downstream classifier f .

