DON'T FORGET THE NULLSPACE! NULLSPACE OCCU-PANCY AS A MECHANISM FOR OUT OF DISTRIBUTION FAILURE

Abstract

Out of distribution (OoD) generalization has received considerable interest in recent years. In this work, we identify a particular failure mode of OoD generalization for discriminative classifiers that is based on test data (from a new domain) lying in the nullspace of features learnt from source data. We demonstrate the existence of this failure mode across multiple networks trained across RotatedMNIST, PACS, TerraIncognita, DomainNet and ImageNet-R datasets. We then study different choices for characterizing the feature space and show that projecting intermediate representations onto the span of directions that obtain maximum training accuracy provides consistent improvements in OoD performance. Finally, we show that such nullspace behavior also provides an insight into neural networks trained on poisoned data. We hope our work galvanizes interest in the relationship between the nullspace occupancy failure mode and generalization.

1. INTRODUCTION

Neural networks often succeed in learning rich function approximators that generalize remarkably well to the distribution they are trained on, but are often brittle when exposed to inputs that come from a different distribution (Gulrajani & Lopez-Paz, 2020) . With rapid adoption of neural networks to various safety critical applications such as autonomous driving, healthcare etc. more attention is being paid to the question of robustness under domain shift (Alcorn et al., 2018; Dai & Van Gool, 2018; AlBadawy et al., 2018) . Recent findings from Huh et al. (2021) hint that overparameterized, deep neural networks are biased to learn functions with (approximate) low-rank covariance structure and posit that this might be related to the phenomenon of implicit regularization (Galanti & Poggio, 2022 ) that has been used to explain in-distribution generalization of deep networks. How might such low-rank structure relate to out-of-distribution generalization? As a simple thought experiment, consider a setting where training data D train is embedded in a three dimensional space (v 1 , v 2 , v 3 ) that exhibits variance only along the first two dimensions (fig. 1 (left)) (with v 3 = 1)foot_0 . Let us train a neural network f θ on this data using a loss functional L(f, D train ). Since v 3 does not contribute to any reduction in training error, standard empirical risk minimization (ERM) (Vapnik, 1999) training need not differentiate between functions f which handle v 3 in different ways. Now consider an out-of-distribution (OoD) dataset which has the same structure as the original dataset along v 1 and v 2 , but the value of v 3 now has a different value, e.g. 10. In this case, one would incur an error (fig. 1 , right) if one learns a function f where f (•, •, v 3 = 1) ̸ = f (•, •, v 3 = 10). Thus, the low-rank simplicity bias, while beneficial for IID generalization (Huh et al., 2021) can potentially cause issues for OoD generalization. In cases where removing the "additional" features observed at test time improves performance (such as in fig. 1 ) we say that the network incurs nullspace error and call the failure mode as "nullspace occupancy". When diagnosing nullspace occupancy related failure, it is important to properly choose the representational basis. Our key technical contribution is combining notions of variation in training data with utility for the downstream network f in order to identify the most important directions for projection. We formalize this as an optimization problem that we solve via projected gradient descent onto orthogonal matrices (Kiani et al., 2022) . Experimentally, we demonstrate the existence of nullspace occupancy related performance degradations across different architectures and datasets evaluated on the DomainBed benchmark (table 2 ). This empirically establishes that nullspace occupancy is an issue for neural networks in OoD settings and suggests further performance improvements by mitigating it. We take first steps towards showing this in practice in the leave-one-out-validation setting from Gulrajani & Lopez-Paz (2020) to improve OoD performance on DomainBed. Overall, our contributions are as follows: • We identify a nullspace occupancy based failure mode for OoD generalization • We demonstrate that this failure mode exists for models trained using ERM on DomainBed • We observe that selecting a few projecting directions with high training accuracy yields the maximum potential improvements using this approach • Interestingly, we also find that in a data poisoning setting (Huang et al., 2020) , the network exploits this nullspace occupancy phenomenon to learn a poorly generalizing classifier (section 4)

2.1. FEATURE PROJECTION MECHANICS

We work in the standard multi-class classification setting with inputs x and labels y. Let z = g(x) ∈ R K be the features extracted in an intermediate layer of the network l using an encoder g, and µ = 1 N N i=1 z i be the mean of the features z i extracted on a training dataset with N datapoints. Let f be the classification function that maps a feature z to the logits. Finally, let V ∈ R K×K = [v 1 , • • • , v K ] be an orthonormal matrix, and a rank m subspace of V is V m ∈ R K×m = [v 1 , • • • , v m ]. Figure 2 : Our methods add projection layers (orange) to an existing pretrained network with an encoder g θ (till layer l) and downstream classifier f . Given a test datapoint x, with a corresponding extracted feature ẑ, one can project the datapoint onto a basis V m of rank m as follows: ẑm = V m V T m (ẑ -µ) + µ (1) For a convolutional network, we perform projection only along the channel dimension of the featurization. That is, for an intermediate representation with n c channels and spatial size H × W , we consider a set of features V m of dimensionality n c . For each spatial location (i, j) ∈ [H] × [W ], we project a vector z ij as follows: ẑm ij = V m V T m (ẑ ij -µ ij ) + µ ij (2) 2.2 WHAT IS THE RIGHT CHOICE OF BASIS V ? We aim to find a basis V that leads to the highest decrease in the training loss L train as we increase the rank m. This ensures that we cover the most important directions for the loss functional L(f, D train ). Intuitively, this incorporates both notions of training sensitivity as well as feature spread, since directions with a lot of spread of training data and where the function has high sensitivity would also decrease the training loss. We propose to directly learn the projection matrix that identifies the most parsimonious V based on the training set. The key idea is simple: we initialize the basis V with the principal component basis computed over the training set, i.e. V = V pc , then denoting µ as the mean of the features over the training dataset (as above) we solve the following optimization problem, using a projective approach for optimizing unitary matrices (Kiani et al., 2022) : min V K m=1 |Dtrain| i=1 L train (f (V m • V T m • (z i -µ) + µ), y i ) (3) subject to the unitary constraint V T • V = I. Here, the training loss L train is cross entropy. For a convolutional network, we perform analogous projections along the channel dimension of the featurization. If we consider the training accuracy versus number of components m, optimizing this equation will yield the basis which maximizes area under the curve. As a result, for any given target accuracy, we find the smallest value of m that achieves that level of training accuracy, thus learning the most parsimonious basis.

2.3. BASELINE CHOICES OF BASES V

We consider ablations on the design of the basis V that capture different intuitive notions of what one considers to be important when choosing the features to project onto (see table 1 ). We explain each in more detail below: Principal Components Analysis: Given the feature matrix Z ∈ R N ×K computed over the training data D train , one can compute the principal components analysis (PCA) (Murphy, 2002) of Z and use it as the basis V . Intuitively, this captures the directions in the feature space which have the most variance in the training data, but might not capture which directions are used by the downstream function f ( table 1 ). As we validate in section 4, considering both is important for diagnosing nullspace occupancy related failures.

Approach

Uses f Uses g Uses L train Random ✗ ✗ ✗ PCA ✗ ✓ ✗ Low-rank W l+1 ✓ ✗ ✗ Jacobian ✓ ✓ ✗ Optimized Basis ✓ ✓ ✓ Table 1 : Design space of the different bases and approaches considered in the paper. g is the network encoder (till layer ℓ) and f is the downstream classifier as illustrated in fig. 2 . L train is the training loss. Function Sensitivity: The Jacobian matrix encodes the sensitivity of the classification function f to varying input features z i . We derive a global feature set (independent of a particular z i ) via the columns of the Jacobian. Intuitively, this gives us a set of directions which maximally explains the sensitivity of the function in the space V . Concretely, we consider the matrix G of samples { df (zi)j dzi } n i=1 , where each sample is the Jacobian vector of one output logit j ∈ Y (chosen by sampling from the output distribution of the classifier f ). We do this since backpropagating through each logit (for each training datapoint) in turn is too expensive. Next, we decompose G = U • S • V T using singular value decomposition (SVD), and use the columns v i ∈ R K of the right-singular matrix V as the feature set. This gives us a set of directions in the input space which explain the dominant directions in which the function varies (but does not capture directions which necessarily have variance in the training dataset D train ). Random Basis: We start with V ∼ N (0, I), where I is the identity matrix in R K×K and then use Gram-Schmidt orthonormalization to get a full rank random matrix V for truncating features V m . Low Rank Linear Layer: Consider the linear embedding (or convolutional kernel) W l+1 in the layer l + 1 proceeding z = g(x). We consider here low-rank approximations to this weight matrix to study if any potential OoD improvements come from a nullspace phenomenon or some other kind of capacity control. Following (Kolda & Bader, 2009) , we consider the SVD of W l+1 = A • S • B T , where S is a diagonal matrix of singular values diag([s 1 , • • • , s r ]) such that s 1 ≥ s 2 ≥ • • • ≥ s r , and r is the rank. A low-rank approximation to W l+1 can then be easily found by truncating s to top m ≤ r singular values, that is, S m = diag([s 0 , • • • , s m , 0, • • • , 0]) and the corresponding low rank W m l+1 = A • S m B T .

3. EXPERIMENTAL SETUP

Training Protocol: We evaluate on a number of domain generalization datasets provided as a part of the DomainBed (Gulrajani & Lopez-Paz, 2020) benchmark. Specifically, we consider RotatedMNIST (Ghifary et al., 2015) , PACS (Li et al., 2017b) , TerraIncognita (Beery et al., 2018) , DomainNet (Peng et al., 2019) , and Imagenet-R (Hendrycks et al., 2021) . We train ERMs using the hyperparameter selection strategy pioneered by (Gulrajani & Lopez-Paz, 2020) , where we train a large number of ERMs using random hyperparameter search, pick the best one on IID validation and use it to report OOD test accuracy -repeating this process for multiple independent trials. On RotatedMNIST we consider two differences to the standard DomainBed protocol: 1) we train on individual domains and evaluate on all others (to emphasize the fact that our method conceptually does not need environment annotations), and 2) we include a search over weight decay (which is fixed to 0 for RotatedMNIST in (Gulrajani & Lopez-Paz, 2020 )) since weight decay might be related to smoothness of f and nullspace occupancy. Following DomainBed, we use ResNet50 for all experiments on PACS, TerraIncognita, DomainNet, and ImageNet-R, and a small CNN for experiments on Rotated-MNIST. We note that both DomainBed and ImageNet-R are released under MIT license. To our knowledge, none of these datasets contain personally identifiable information or offensive content. OOD Accuracy Oracle: Given trained networks f • g (fig. 2 ), our objective is to validate whether OOD accuracy improves when we project out the nullspace components for a test data point. Each of the datasets above has multiple train and test environment splits in DomainBed. For the best ERM network that DomainBed yields for a given set of training evironments, we sweep through all the layers and for each layer, we sweep through different values of m, perform the projection and note the peak performance improvement achieved. We then average this number across multiple choices of the training environments (each of which corresponds to a different network) to obtain the "oracle" accuracy (o) improvement. Following DomainBed methodology, we repeat the entire process for 3 independent trials, and report the mean oracle accuracy achieved. In another experiment, we sweep through all the layers, but choose the value of m via. some heuristic that can be computed via. D train . In such cases we call the oracle improvement as "layer-oracle" accuracy improvement or lo, since it is an oracle that has access to the ideal layer. The OOD accuracy oracle provides an upper bound to the improvement achievable by our feature projection method, and is indicative of the amount of generalization failure explained by nullspace failure. Leave-one-out evaluation: A limitation of the oracle evaluation method is that there is no way to pick the correct layer in which to perform projection. Towards obtaining a practically achievable improvement, we perform an additional experiment where we select the layer l and number of components m based on highest accuracy on a held-out training domain. Our methodology for PACS, TerraIncognita, and DomainNet datasets is as follows: given a test domain, we hold out one training domain for selecting projection hyperparameters and train a network using ERM on the remaining training domains. We fit bases on the remaining domains, select the layer l and number of components m to get the highest achievable improvement on the held-out domain, and evaluate performance on the test domain. We train a network on every combination of training and held-out domain, and compute the mean accuracy on the test domain over networks trained with different held-out domains. We repeat this entire process for 3 independent trials, and report mean accuracy and standard error. For Rotated-MNIST, we train networks on 0-degree data, select the layer and number of components on 45-degree data, and evaluate on remaining domains. basis (green) does not improve performance over the base ERM. Instead, the performance converges montonically to that of the base ERM model when m is full-rank.

Does

In contrast to the behavior of the random basis, methods which utilize f , g or L train (table 1 ) exhibit a different structure with respect to m. While they all converge to the base ERM accuracy at full rank m (right end of the plot), at intermediate values of m we observe that projection improves the OoD performance of the network. More specifically, with increasing number of components, the performance first increases to an optimal choice of m and then decreases as more and more spurious nullspace components are added. The best OoD performance is achieved by the optimized basis (red line fig. 3 ), which explicitly finds the directions leading to the highest marginal improvement in the training loss L train and discards directions which don't improve L train . This suggests that the optimized basis finds a more meaningful set of directions for a given m compared to utilizing PCA (that only considers g and not how f uses the different directions) or the Jacobian-based feature spaces (which considers f 's sensitivity to different directions but not whether the sensitivity contributes to improvement in the training loss L train ). Overall, this suggests that nullspace occupancy is a mechanism of OoD failure, and that the optimized basis is the best choice of basis V with which to investigate this mechanism.

Does projection improve OoD performance?

Next we study what improvements in performance are possible with perfect heuristics or "oracle" choices of layer l and rank m on the DomainBed benchmark. In addition to ERM, we compare with the CORAL (Sun & Saenko, 2016) , SagNet (Nam et al., 2019) , and Interdomain-Mixup (Yan et al., 2020) models (from DomainBed) which achieve state-of-the-art performances on the datasets explored in our work. Heuristic for choosing m: As explained in (section 3) we report two oracles for each basis: 1) "oracle" assumes access to both the best layer as well as best number of components m, and 2) "oracle-layer" which selects m using a heuristic and picks the best layer. Thus, "oracle-layer" is a more practical indicator of achievable OoD improvements while "oracle" provides an upper bound on achievable performance. For the "oracle-layer" approach we choose m at the point when the training accuracy saturates to 99.9% of the full value. Notice that for the optimized basis, the oracle numbers (o) compare favourably to the SagNet and CORAL algorithms on PACS, TerraIncognita as well as DomainNet (suggesting possible improvements of 3.3 ± 1.0 over SagNet on TerraIncognita, for example). Interestingly, the layer oracle (lo) which uses a heuristic for choosing m also gives consistent improvements over ERM (table 2). A similar table with results on in-distribution generalization performance is included in Appendix F. One concern when reporting oracle numbers as an aggregate, maximizing over m and choosing an oracle layer L, is if the reported improvements are significant or whether even a baseline model (at the same performance as the original ERM equipped with randomness in how it labels) could achieve the same improvements given m × L trials. We compare against such baseline models (for the multi-class classification case) in Appendix C. Our results show that the systematic improvements we report in table 2 are highly unlikely to occur due to random chance (see Appendix C for details).

Does nullspace occupancy yield non-oracle improvements?

We next study if it is possible to choose the layer l and rank m using an unseen training domain, to obtain performance improvements on a novel, held-out OoD test domain. Concretely, we train a network using standard DomainBed protocol on say d training domains, use an unseen training domain d + 1 to pick l and m, and report performance after projection on a test domain d + 2. Encouragingly, the optimized basis yields consistent improvements over ERM across all three datasets we evaluated on, namely, Rotated-MNIST, PACS, and TerraIncognita (table 3 ). Further, the optimized basis is the only method that achieves consistent improvements over ERM in this setting. This demonstrates the value of learning the optimized basis that makes use of f , g and L (table 1 ), over other baseline approaches for choosing the basis. Importantly, this also indicates that reducing nullspace occupancy is a viable method for improving OoD performance of already trained ERM networks in more practical, non-oracle settings. Is nullspace occupancy distinct from "capacity control"? We compare against keeping the same activations ẑ in layer l at test time, but instead reduce the rank of the matrix W l+1 in the next layer. This implements a simple, inference-time way to control the capacity of the function f locally around layer l. Across all datasets, for both oracle as well as layer oracles table 2, we find that this yields inferior results to both the optimized basis as well as the PCA basis. This result demonstrates that it is important to take both the variance in the training data, as well as the overall behavior of f into account when reducing the rank, as opposed to reducing the rank only based on local characteristics of the network in the next layer. Is projection complementary to L2 regularization? L2 regularization (Krogh & Hertz, 1991 ) is a commonly used heuristic for training deep models and is thought of commonly as a regularizer. Given parameters θ and a loss function L train , L2 regularization optimizes L train + λ||θ|| 2 2 , where || • || denotes the L 2 norm. It is intuitive that L2 regularization should lead to "simpler" networks (in the linear case this corresponds to lower-rank solutions) which may not suffer from the nullspace failure mode. While the results in table 2 already include models trained with weight decay (based on standard DomainBed hyperparameter sweeps), in this section we further scrutinize the connection between nullspace occupancy failure and L2 regularization, since it is possible that in the presence of such regularization one might extrapolate more smoothly in a low-rank nullspace (see appendix E). Instead of the standard DomainBed procedure, where for each set of train and test environments (termed a condition) one trains a large number of ERM models via random hyperparameter selection (Gulrajani & Lopez-Paz, 2020) and picks the best model, we instead bin the networks into sets based on the strength λ of the L2 penalty used to train them (fig. 7 ) and then pick the best ERM model (using validation data from the training domain). This allows us to assess the complementary benefits of our approach (which completely projects down to a chosen subspace) over L2 regularization which should in principle ensure more smooth functions. We plot performance of the best model in each L2 penalty bin and compare it to the oracle improvements achievable over that specific model using projection into the optimized basis (fig. 7 ) on RotatedMNIST (see Appendix B for more datasets). We find that projection gives consistent improvements over L2 regularization across the different buckets (x-axis fig. 7, right ). This suggests that L2 regularization and projection are quite complementary to each other, and that nullspace failure occurs in deep networks even in the presence of L2 regularization. Does nullspace occupancy emerge in data poisoning? So far we studied a setup where we intervened by projecting the representation down to a subspace V m , and we observed that this can improve the performance of the network f operating on the projected features v x m . What if we instead did the reverse? That is, does the nullspace occupancy phenomenon emerge in cases where we explicitly force neural networks to not generalize? To test this, we utilize the experimental setting of (Huang et al., 2020) who show how to take training and test datasets D train = {x i , y i } |Dtrain| i=1 and D test = {x i , y i } |Dtest| i=1 and find a network u θ that generalizes poorly to D test . Let u θ : X → P(Y) be the neural network mapping inputs X to the a distribution over the labels P(Y). Then we train to optimize: min θ 1 |D train | |Dtrain| i=1 -log u θ (y i |x i ) + 1 |D test | |Dtest| i=1 -log(1 -u θ (y i |x i )) We perform our experiment on the MNIST dataset by splitting the dataset into a train and test split. We essentially train to perform well on the training set and badly on the test set, yielding a small convolutional network u θ which by construction performs poorly on the given test set D test . Having "poisoned" the test set, we check if projecting the test datapoints x ∈ D test to a subspace V m can "de-poison" the network. fig. 4 shows that this indeed happens in layer 5 for this network, and that one can recover a large fraction of the lost performance due to poisoning by projecting into the first m components of the PCA basis that explain most of the variance. This indicates that the phenomenon of nullspace occupancy contributing to generalization failure has applicability in other settings which are not necessarily connected to out-of-distribution general-Published as a conference paper at ICLR 2023 ization and hints at some interesting underlying mechanism that connects nullspace occupancy to generalization more broadly. See Appendix G for additional experimental details. Finally, we performed experiments on adversarial robustness (see appendix K) where we found similar trends as data poisoning -namely that both PCA as well as the optimized basis lead to gains in robustness, but the gap between PCA and optimized basis is not as large as observed in more traditional OoD generalization tasks (such as on DomainBed).

5. RELATED WORK

Out-of-Distribution (OoD) Generalization. There has been considerable interest in the out-ofdistribution generalization problem (Gulrajani & Lopez-Paz, 2020; Arjovsky et al., 2019; Sagawa et al., 2019; Li et al., 2017a) where the goal is to generalize the classifier f • g to novel input domains D. A number of popular approaches have been pursued including causal invariance (Arjovsky et al., 2019) , gradient starvation (Shah et al., 2020; Pezeshki et al., 2020) , risk extrapolation (Krueger et al., 2020) , meta-learning (Li & Li, 2017) , distributionally robust optimization (Sagawa et al., 2019; Sinha et al., 2017) etc.. Among these, causal invariance and gradient starvation (Pezeshki et al., 2020) (or simplicity bias (Shah et al., 2020) ) are quite relevant to us, since they prescribe certain failure modes which might be causing poor out-of-distribution accuracy. In contrast to work on causal invariance, our work does not require an underlying "true" causal model or annotation of different environments (except for the purposes of leave-one-out domain evaluation). While gradient starvation is concerned with features v 1 and v 2 where both are potentially useful, our work is focused on removal of additional/extra features, rather than using all the relevant features. In this sense, gradient starvation and our work are quite complementary. Another relevant approach is that of last-layer feature reweighting. Rosenfeld et al. (2022) and Kirichenko et al. (2022) show that ERM already learns some generalizable features by showing that retraining the last layer using access to the target distribution or a non-spuriously correlated reference distribution improves OOD performance. In contrast, we show that it is possible to drop extra nullspace features without access to a target or reference distribution to improve network performance. Nullspaces, PCA, and Representation Learning. Low-rank feature spaces (primarily using PCA) have been utilized for detecting adversarial examples (Li & Li, 2017; Carlini & Wagner, 2017) and for improving adversarial robustness (Sanyal et al., 2018) . In contrast to these works, we are focused on "semantic" OoD generalization as opposed to adversarial examples. Other recent work has noted that neural networks learn feature spaces which are essentially low rank (Huh et al., 2021) . Our work makes use of that observation to study OoD generalization and how choices of bases affect generalization. In contrast to these works, which largely utilize PCA, our work considers the downstream network f as well to identify the directions that are most predictive of the label on the training set. In this sense, our work is like a deep version of partial least squares (PLS) (Geladi & Kowalski, 1986 ) which aims to find the discriminative directions in feature space for linear models. Low-rank models and Pruning. There is a lot of work in model compression that focusses on reducing the rank of the layer weights W l after training (Cichocki et al., 2016; Udell et al., 2016) . A naive data-independent strategy includes minimizing the distance ||W -U V T || between the original network layer weight W and the compressed weights U V T , using a closed form solution such as SVD Denton et al. ( 2014); Novikov et al. (2015) . We utilize these techniques in our low-rank W l+1 baseline (section 2) to compare against our feature projection techniques (table 2 ). In contrast to these works which focus on compression, our goal is instead to improve downstream OoD performance (our low-rank basis does not yeild any model compression). (Zhang et al., 2021) relates model pruning to out-of-distribution generalization, showing that even in models which rely on spurious correlations, there exist subnetworks that are robust to spurious correlations. Similar to this work, we reduce the capacity of the network (or a particular layer in our case) to study OoD generalization. However, unlike network pruning, our technique focuses on a particular layer and exposes nullspace occupancy as a mechanism by which OoD failure can happen.

6. DISCUSSION

Nullspace removal is not sufficient for improving generalization. We observe that while there usually exists a layer l ∈ {1, • • • , L} where performing the projection improves the oracle out of distribution accuracy, it is not necessary that the projection helps in every layer. To understand this, it is helpful to think of a breakdown of error ϵ into two different sources, namely ϵ null which is the nullspace error and ϵ ds which is the distribution shift error. One can have a distribution shift even if the test data spans the top V m components which explain the training data very well since even if the components are same, the distribution of the test datapoints could end up being quite different. Thus, it is not necessary that removing the nullspace error ϵ null would lead to a low overall error ϵ. However, in practice it seems to play a role in OoD generalization. Nullspace occupancy is not necessary for poor generalization. It is also straightforward to notice that, if a trained network learns appropriately smooth functions f (v 1 , v 2 ), ignoring an extra feature v 3 = 0 for example, then it would not be troubled by test data occupying the nullspace. In practice, it appears that networks learnt even in the presence of weight decay do not show such a behavior which suggests that nullspace occupancy might be a mechanism for explaining generalization failure in those cases. Connection to other OoD generalization failure modes. Given input features {v 1 , v 2 } ∈ V present in a dataset D train , and a trained neural network function f (v 1 , v 2 ) from V → Y, where Y is the label space (fig. 1 ), recent work has studied two popular modes of out-of-distribution generalization failure with deep learning: • Causality: Given a label y, a causal feature v 1 and a "spurious" correlated feature v 2 , the idea is to learn a function f (v 1 ) that ignores v 2 (Arjovsky et al., 2019) . Instead, if we learn a function f (v 2 ), we fail due to spurious correlation. • Gradient Starvation (Pezeshki et al., 2020; Shah et al., 2020) : Given two predictive features v 1 and v 2 , where v 1 elicits a stronger response at initialization than v 2 , the dynamics of learning yields f (v 1 ) that ignores the weaker, but predictive feature v 2 . This can lead to generalization failure in a novel environment. Our approach is distinct from causality in the sense that we are simply measuring geometric properties of the learnt features as opposed to assuming any underlying causal model that induces OoD failure. More closely related to our work is gradient starvation. In the language of gradient starvation, let us assume a stronger feature v 1 , a less dominant feature v 2 and an invariant feature v 3 = 0. Gradient starvation considers the case of v 1 and v 2 , and suggests that the strong feature v 1 might inhibit the learning of the weak feature v 2 , at least in the neural tangent kernel regime (Jacot et al., 2018) . A priori, one might expect that v 1 would suppress the learning of v 3 as well, but our work shows empirically that this is not the case. Our initial attempts to extend the theory of (Pezeshki et al., 2020) revealed that their framework cannot easily accommodate a notion of a "useless" feature, and the linearity assumptions in the NTK regime might not explain the non-smooth behavior of deep neural networks (see Appendix D). However, it is interesting to note that while the gradient starvation work attempts to mitigate this and learn more features, our work in contrast attempts to discard uninformative and redundant features.

7. CONCLUSION

In this work we introduced the concept of nullspace occupancy -namely when test data occupies the nullspace of a low-rank feature space V m that captures the training set variability -and connected it to OoD generalization. We showed that with a careful choice of the basis V m , systematic improvements in OoD generalization can often be obtained by projecting out the remaining nullspace components for test data points. We found this "optimized basis" by performing an optimization over orthogonal matrices to identify a set of components that minimizes the area under the training loss curve for different choices of rank m. On domain generalization experiments on RotatedMNIST, PACS and TerraIncognita we found that this choice of basis has the potential to improve out-of-distribution accuracy of ERMs. Finally, we also found that nullspace occupancy emerges in a setting where one poisons the network to perform poorly on a predetermined test set. Together, these results hint at a broader interplay between nullspace occupancy and generalization in deep learning. There is much to explore to understand the ubiquity of this novel mode of generalization failure and further refine our approach to mitigate it, and we hope the community works to address these issues in the future. 88.5 ± 1.0 82.9 ± 1.7 96.2 ± 0.5 80.1 ± 1.3 86.9 Low Rank W l+1 (lo) 87.9 ± 1.2 81.4 ± 1.7 96.1 ± 0.5 78.3 ± 1.7 85.9 Optimized (lo) 88.0 ± 0.9 82.9 ± 1.9 96.5 ± 0.4 82.0 ± 0.5 87.4 

B ADDITIONAL PLOTS

In this section, we present figures from the main paper evaluated on additional datasets. In Figure 3 of the main paper, we show that nullspace accuracy relates to OOD failure by plotting the improvement in OOD classification accuracy against the dimensionality of the subspace used for many different bases for Rotated-MNIST, PACS, and TerraIncognita. Fig 5 shows the corresponding plot for one randomly selected network from DomainNet. In Figure 4 (right) of the main paper, we verify that the effect of projection is complementary to that of weight decay by plotting the improvements obtained by projection along with the base network accuracy for networks trained with different values of weight decay on Rotated-MNIST. fig. 6 presents the corresponding result on PACS, where we again see that projection gives consistent improvements across different buckets of weight decay, suggesting that weight decay and projection are complementary to each other.

C COMPARISON WITH MAXIMUM PERFORMANCE OF A LARGE NUMBER OF BASELINE MODELS EQUIPPED WITH RANDOMNESS

In the main text, we report oracle numbers that are obtained by maximizing over many choices m of subspace dimensionality, and oracle layer L. These numbers are the maximum over many different trials, raising the concern that they may not be significant because even a baseline model with the 60.3 ± 0.9 20.5 ± 0.4 46.3 ± 0.9 13.0 ± 0.1 59.9 ± 0.6 49.8 ± 0.5 41.6 PCA (o) 61.0 ± 0.9 21.0 ± 0.4 46.4 ± 0.9 13.8 ± 0.1 60.3 ± 0.6 50.8 ± 0.5 42.2 Low Rank W l+1 (o) 59.9 ± 0.8 20.5 ± 0.4 45.8 ± 0.9 13.0 ± 0.2 59.8 ± 0.5 50.4 ± 0.7 41.6 Optimized (o) 61.1 ± 0.9 21.0 ± 0.4 47.0 ± 0.9 13.5 ± 0.1 61.1 ± 0.7 51.0 ± 0.8 42.4 Random (lo) 59.4 ± 0.7 19.7 ± 0.2 45.5 ± 0.5 12.0 ± 0.2 58.3 ± 0.2 49.2 ± 0.7 40.7 Jacobian (lo) 59.7 ± 0.9 20.0 ± 0.4 45.5 ± 0.9 12.5 ± 0.2 59.4 ± 0.6 49.3 ± 0.5 41.0 PCA (lo) 60.1 ± 1.0 20.2 ± 0.3 45.7 ± 1.0 12.7 ± 0.3 59.7 ± 0.6 49.9 ± 0.5 41.4 Low Rank W l+1 (lo) 59.4 ± 0.8 20.1 ± 0.3 45.3 ± 0.9 12.5 ± 0.2 59.4 ± 0.5 49.6 ± 0.5 41.0 Optimized (lo) 60.7 ± 0.9 21.0 ± 0.4 45.8 ± 1.1 12.9 ± 0. same accuracy as ERM equipped with randomness in labelling could achieve the same improvements given m × L trials. In this section, we construct one reasonable choice of such a model, and compare the improvements obtained via our oracle method with an upper bound of the improvement obtainable via the random baseline model. Let the base (ERM) model be M with classification accuracy x on the OOD test set. Let the number of samples in the test set be n, the number of layers L, and the number of considered choices of feature space dimensionality be m. In this setting, the effective number of experiments is d = L × m. We therefore consider a set of non-deterministic models W 1 , . . . , W d , where each model W i predicts the label correctly for each test sample with probability x. Let the out-of-distribution accuracies of models W 1 , . . . , W d be random variables x 1 , . . . , x d . It follows that E [x i ] = x, and V [x i ] = x(1-x) n . We wish to compare the improvements obtained by our method with E max i∈ [d] x i . To obtain an upper bound for this expectation, we assume that variables x i ∼ N (x, x(1-x) n ). We then use the following result: Theorem 1 Orabona & Pál (2015) Let X 1 , . . . , X d be independent Gaussian Random Variables N (0, σ 2 ). For any d ≥ 2, σ 2 log d ≥ E max i∈[d] X i ≥ σ 1 -exp - √ ln d 6.35 √ 2 ln d -2 ln ln d + 2 π - 2 π σ This gives us the upper bound: We plot the OOD accuracy of selected networks averaged across test domains along with the average oracle improvement on the selected networks. We observe our method give comparable performance for all weight decay values. E max i∈[d] x i ≤ x + σ 2 log d = x + x(1 -x) n 2 log(mL) We compare this upper bound with the oracle performance obtained via projection on the ImageNet-R dataset, which offers a tight upper bound due availability of a large number of test samples. In our experiments on ImageNet-R, we consider projections in L = 5 layers, and m = 32 choices of dimensionality of feature subspace. The base accuracy of ERM is x = 36.1%, yielding an upper bound of E max i∈ [d] x i -x ≤ 0.88%. In contrast, the oracle accuracies of our projection method yield performance improvements of 1.3% when using the PC basis, and 2.3% when using the optimized basis. These results suggest that our method provides a systematic improvement to generalization performance, and is not an artifact of taking a maximum over a large number of models. Note that the baseline model we consider, constructed by using the maximum performance with a set of models that achieve the same accuracy as the base model, is a very strong baseline. For reference, if we consider a naive model equipped with randomness that starts with the base ERM and randomly flips the prediction with probability p, it would achieve accuracy p • x + (1 -p) 1-p #classes . This could be significantly lower than x and the maximum of a large number of such models could still yield performance less than x. D CONNECTION TO GRADIENT STARVATION (PEZESHKI ET AL., 2020) Consider our running example of a training set with 3 features {v 1 , v 2 , v 3 = 1}, where feature v 3 is a nullspace feature, and has a constant value of 1. Gradient Starvation (GS) considers the setting where v 1 is a "stronger" feature than v 2 , and suggests that the strong feature v 1 inhibits the learning of the weak feature v 2 . In the context of our work, one might expect that v 1 would also inhibit the learning of v 3 , thus potentially mitigating generalization failure due to nullspace occupancy. However, our empirical results indicate that this is not the case. To reconcile this discrepancy, we attempt to fit nullspace features within the theoretical framework of Gradient Starvation. GS considers a setting with n data points X = [x 1 , . . . , x n ] ∈ R n×m , labels y = {-1, 1} n and Y = diag(y) ∈ R n×n , and function output ŷ(X) = f (X) for neural network f . GS looks at the NTK parameterization, in which the network output is approximated as a linear model of the function of the parameters, near initial parameters θ 0 : ŷ(X, θ) = ŷ(X, θ 0 ) + Φ 0 θ where Φ 0 = Φ(X, θ 0 ) is the neural tangent random feature (NTRF) matrix at initialization, i.e.: Φ(X, θ) = dŷ(X,θ) dθ ∈ R n×m . GS then seeks to characterize the response of the network to a training example, where the response is defined as the deviation from its initial value: r = Y(ŷ -ŷ0 ) = YΦ 0 θ. The theory studies features defined in terms of the singular value decomposition of the NTRF: YΦ 0 = USV T , where (V T ) j. is the jth feature, (S) jj is the singular value of that feature and (U) .j are the weights of the feature in each example adjusted according to the labels y. The network's response to each feature j can then be expressed as z = U T r = SV T θ. To evaluate the prediction of the Gradient Starvation framework to nullspace features, our goal is to look at the response to nullspace features, when they appear in a new out-of-distribution example x o . Note first that features considered under the framework are obtained via singular value decomposition of the NTRF, and the diagonal matrix of singular values S has rank n, i.e. all the singular values are positive. Hence, the Gradient starvation framework does not include any features with singular value zero (nullspace features), and can only explain suppression of features in the training subspace. To attempt to extend the framework to include nullspace features, we extend the feature set obtained via SVD of the NTRF, by extending the set of examples [x 1 , . . . , x n , x o ] ∈ R (n+1)×m , to now also include sample x o , a out-of-distribution example that contains variation along some nullspace feature v o . Looking at the response z = SV T θ, we note that S would now have an additional element, leading to a nonzero singular value along v o . However, evaluation of training dynamics in GS looks at optimizing θ in the assumed linear approximation from equation 5 via minimizing ridge-regularized cross entropy loss, and any amount of ridge-regularization would make it such that the component of θ along v o would go down to exactly zero. As a result, the response z = SV T θ along nullspace feature j would have nonzero corresponding values in S and V T , but a coefficient of 0 in θ. Therefore, this extension of the GS framework would predict a network response of zero along any nullspace component, at odds with the empirical results we find in this work.

E WEIGHT DECAY AND PROJECTION: AN ANALYSIS

Let us consider the weight decay regularization and how that relates to the eigenvalues of the weight matrices W l ∈ R m×K learnt in different layers l. Specifically, weight decay minimizes the Frobenius norm ||W || F , which can also be written as tr(W • W T ). Consider the Singular Value Decomposition (SVD) of W = U ΣV T , a rank r for W , and let u i refer to a column of U . We can then write: tr(W • W T ) = tr(U ΣV T V ΣU T ) (9) = tr(U Σ 2 U T ) (10) = tr( r i=1 σ 2 i u i • u T i ) (11) = r i=1 σ 2 i tr(u i • u T i ) (12) = r i=1 σ 2 i ( ) Where in the first line we used the fact that V T V = I, then used the fact that tr (trace) is a linear operation, and then used the fact that ||u i || = 1. Thus, adding weight decay to a network yields small singular values (in an L-2 sense). This, however does not yield sparsity. Thus, in some sense, "full-rank" information can potentially pass through multiple layers of a network with weight decay (this would not be possible if the rank r was really small (or σ i = 0) in contrast, for example. This explains the complementary benefits of performing projection even in the presence of weight decay (as shown in the main paper).

F IN-DISTRIBUTION GENERALIZATION RESULTS

In this section, we report the performance of layer-oracle and oracle projection methods on indistribution generalization in table table 8 . Interestingly, we find that the optimized basis consistently outperforms the other choices of bases. For example, on PACS, the optimized basis yields improvements of 0.9% as opposed to 0.3% for say, the PCA basis. Comparing the IID to the OOD setting, we find that the PCA basis yields much smaller improvements compared to the optimized basis. The ratio of the improvements in oracle from PCA basis to the optimized basis (averaged across all datasets) is 0.42 for IID compared to 0.79 for OOD. This suggests that, consistent with our intuition, the OOD datasets have more features occupying the nullspace in the PC basis and thus the PC basis performs better in OOD and worse in IID. However, considering the impact of the downstream function in addition to the variance (which our optimized basis does) can improve performance in IID setting as well.

G DATA-POISONING EXPERIMENT DETAILS

We perform our data poisoning experiments on the MNIST dataset. We split the MNIST dataset into two halves: a clean training set and a poisoned test set. We train a network with 4 convolutional layers, an average pooling layer, and 3 linear layers to perform well on the training set and poorly on the test set. All convolutional layers have a kernel size of 3 × 3, a padding of 1, and 64, 128, 128, and 128 channels respectively. The second convolutional layer uses a stride of 2, while all others use a stride of 1. Each convolution is followed by a ReLU nonlinearity, and a GroupNorm operation with 8 groups. After the convolutional layers, an average pooling layer with a kernel size of 1 × 1 and a stride of 1 is applied such that only a single value remains per channel. The resulting representation is fed to 3 linear layers of sizes 128 × 64, 64 × 32, and 32 × 10 respectively. Each linear layer save for the last is followed by a ReLU nonlinearity, and a batch normalization layer. To train the network, we use the Adam optimizer with β 1 = 0.9, and β 2 = 0.999. We train the network for 100 epochs with a batch size of 64 and learning rate of 1e-3.

H HOW DO IMPROVEMENTS IN ACCURACY RELATE TO THE DISTANCE FROM THE SUBSPACE?

Do the observed improvements occur because we project to the subspace or is some other perturbation of the test data point x equally beneficial? To answer this, we consider an experiment where we partially project down the features V ′ m . Formally speaking, each v x can be written as a sum of two feature vectors v x = v x 1 + v x 2 such that v x 1 lies in subspace V m and v x 2 lies in subspace V ′ m . For λ ∈ [0, 1], we consider the partial projection to V m as v x m = v x 1 + λv x 2 . If λ = 1, we recover the original feature vector (no projection) whereas λ = 0 yields the full projection to V m (methods discussed above). fig. 7 shows the interpolation between λ and the fraction of OOD improvement. It contains the plots for this experiment across the 12 best ERMs on PACS and TerraIncognita. Overall, we find that as we project out the component of feature vector in V ′ m and thus increase the distance from the subspace V m , performance gradually decreases (roughly monotonically) for both the optimized basis as well as the PCA basis. This suggests that improvements are likely due to the projection of the datapoints to the subspace V m as opposed to some other kind of structured perturbation. Figure 7 : Fraction improvement vs distance from optimal subspace: For the optimal subspace V m (i.e. optimal layer and m), we project out 1 -λ fraction of the component orthogonal to V m . For the resulting feature vector, we plot accuracy improvement divided by the optimal subspace accuracy improvement (y-axis) vs fraction of distance from the optimal subspace, λ (x-axis). Solid lines show the mean fraction improvement over 12 DomainBed ERMs on PACS (left) and TerraIncognita (right), and shaded portions indicate the standard deviation.

I ARCHITECTURE FOR ROTATED-MNIST EXPERIMENTS

Following DomainBed, we use the ResNet-50 architecture for all experiments on PACS, TerraIncognita, and DomainNet, and a small CNN for experiments on Rotated-MNIST. The architecture for the small CNN consists of 4 convolutional layers, succeeded by global average pooling and one linear layer. The convolutional layers are of kernel size 3x3 and "same" padding, with 64, 128, 128, and 128 channels each. Each convolution layer is followed by a ReLU activation and a GroupNorm layer with 8 groups.

J ADDITIONAL PLOTS

In this section, we include plots of out-of-distribution accuracies after projection into all the different bases, for all networks, across Rotated-MNIST (fig. 8 ), PACS (fig. 9 ), TerraIncognita (fig. 10 ), and DomainBed (fig. 11 ). Following DomainBed methodology, we have three selected networks for each OOD domain (one from each independent trial), and plot the out-of-distribution accuracies after projection into different bases in the oracle layer of each network. The x-axis is the rank of the subspace used for projection, while the y-axis is the improvement in OOD classification accuracy relative to the base network. Each row of plots corresponds to a different out-of-distribution test domain. evaluate oracle performance after projection on adversarial examples generated using AutoAttack (Croce & Hein, 2020) , for a WideResNet-28-10 network trained on CIFAR-10. We consider attacks with a maximum L inf norm of ϵ = 8/255. We report oracle accuracies for the PC and optimized bases, as well as corresponding CIFAR-10 test set performance in table 9, and plot accuracies for all choices of number of components in fig. 12. We additionally include in table 9 results for the augmentation driven adversarial defense of Gowal et al. (2021) , a representative state-of-the-art method for this dataset and architecture. We find that both the PC basis and optimized basis recover a meaningful amount of the network's performance on the adversarial set. 

L PSEUDOCODE FOR COMPUTING ORACLE AND LAYER-ORACLE ACCURACIES

In this section we present pseudocode for the procedure used to report oracle and layer-oracle accuracies given the OoD and IID accuracies of a network for all choices of layer and number of components. Algorithm 1 Pseudocode for computing oracle and layer-oracle accuracies for one network given OoD and IID accuracies for all considered numbers of components and layers In this section we present pseudocode for optimizing the objective in equation 3 using the unitary optimization procedure of Kiani et al. (2022) . Algorithm 2 Pseudocode for obtaining the optimized basis 1: Input: Encoder g to target layer, classifer f from target layer, training mini-batches {(x i , y i )} N i=1 , training loss L train , step size α, PC basis V pc , µ pc 2: Initialize basis V ← V pc , µ ← µ pc 3: for i ∈ {1, . . . , N } do for number of components k ∈ {1, . . . , m} do 7: Project representation using eq 1 to get ẑ, using z, V, and k components. Project gradient p ← 0.5 × (g -V g T V ) 13: Update basis V ← V exp (-αV T p) 14: end for 15: Return V, µ



This is a special case ofHuh et al. (2021) where the third eigenvalue is 0, instead of being very small



Figure 1: Illustration of nullspace failure. Left: Training data with variation in v 1 , v 2 but no variation in v 3 = 1. Black line: Decision boundary learnt by a 3 hidden layer MLP function f with inputs (v 1 , v 2 , v 3 ) visualized with v 3 = 1. Right: Decision boundary of the same classifier evaluated on the plane v 3 = 10. f is sensitive to v 3 , causing nullspace error (red box) on test data.

Figure 3: Nullspace occupancy relates to OoD failure: Projecting test OoD activations onto a subspace of rank m (x-axis) improves OoD performance (y-axis). For each dataset -Rotated-MNIST (left), PACS (middle), and TerraIncognita (right) -we pick one illustrative network out of 12 networks and the layer which best explains the nullspace occupancy related failure mode.

Figure4: (left) Oracle improvements (o) vs L2 penalty: For Rotated-MNIST, for each bin of L2 penalty values and test domain, we train 15 networks and select the one that achieves highest validation accuracy. We plot the OoD accuracy of selected networks averaged across test domains along with the average oracle improvement on the selected networks. We observe our method give comparable performance for all L2 penalty values. (right) Nullspace occupancy for poisoned test data: Classification accuracy after projecting network activations to the top-m components in the PC basis and random basis. The black and red curves are the amount of total variance explained by the first-m components. We observe that for PC basis, test set accuracy increases till training variance explained is saturated and sharply declines when including nullspace components.

Figure5: Nullspace occupancy relates to OOD failure: Projecting test OOD activations onto a subspace of rank m (x-axis) improves OOD performance (y-axis). We randomly pick one of 18 networks trained on DomainNet and the layer which best explains the nullspace occupancy related failure mode.

Figure 6: Oracle improvements (o) vs weight decay: For PACS, for each bin of weight decay values and test domain, we train 15 networks and select the one that achieves highest validation accuracy.We plot the OOD accuracy of selected networks averaged across test domains along with the average oracle improvement on the selected networks. We observe our method give comparable performance for all weight decay values.

Figure8: Improvements in out-of-distribution classification relative to full network on Rotated-MNIST. The x-axis is the rank of the subspace used for projection, while the y-axis is the improvement in OOD classification accuracy relative to the base network. Each row of plots corresponds to a different out-of-distribution test domain.

Figure9: Improvements in out-of-distribution classification relative to full network on PACS. The x-axis is the rank of the subspace used for projection, while the y-axis is the improvement in OOD classification accuracy relative to the base network. Each row of plots corresponds to a different out-of-distribution test domain.

Figure 12: Nullspace occupancy for adversarial examples. Classification accuracy on CIFAR-10 test set and adversarial example set after projecting to the top-m components of the PC and optimized bases in layer 0, the oracle layer.

Input: Layers L = {0, . . . , l -1}, OoD accuracy matrix A ∈ R l×m , IID accuracy matrix B ∈ R l×m 2: Set oracle layer l o ← arg max i∈L max j∈{0,...,m-1} A ij 3: Set heuristic index k ← min j∈{0,...,m-1} j such that B[l o , j] ≥ 0.999 × B[l o , m -1] 4: Oracle accuracy o ← max j∈{0,...,m-1} A[l o , j] 5: Layer-oracle accuracy lo ← A[l o , k] M PSEUDOCODE FOR OPTIMIZING UNITARY MATRICES TO OBTAIN THE OPTIMIZED BASIS

target layer z ← g(x i ) 6:

ŷ ← f (ẑ) 9: l ← l + L train (ŷ, y)

Aggregate domain generalization performance achieved by different bases on RotatedMNIST, PACS, TerraIncognita, and DomainNet. For each basis, we report the layer-oracle accuracy (lo) and the overall oracle performance (o). Error bars are in Appendix A along with per-domain breakdowns.

Aggregate domain generalization performance achieved by different bases on RotatedMNIST, PACS, and TerraIncognita using leave-one-out hyperparameter selection. We report mean results and standard errors over three independent trials.

Per-domain oracle (o)  and layer-oracle (lo) out-of-distribution classification accuracies for different choices of bases on PACS.

Per-domain oracle (o) and layer-oracle (lo) out-of-distribution classification accuracies for different choices of bases on TerraIncognita.

Per-domain oracle (o)  and layer-oracle (lo) out-of-distribution classification accuracies for different choices of bases on DomainNet.

Aggregate in-distribution generalization performance achieved by different bases on Ro-tatedMNIST, PACS, TerraIncognita, and DomainNet. For each basis, we report the layer-oracle accuracy (lo) and the overall oracle performance (o).

Oracle performance after projection on the adversarial set, along with corresponding clean test set accuracies on CIFAR-10.

annex

We include the following appendices:• Appendix A: We report per-domain domain generalization results for projection in all bases on Rotated-MNIST, PACS, TerraIncognita, and DomainNet.• Appendix B: We provide figures from the main text on additional datasets.• Appendix C: We demonstrate that our reported oracle results are not an artifact of considering the maximum of many models. To establish this, we compare achieved improvements against the maximum performance achieved by a large set of models with the same accuracy as the base ERM, equipped with randomness.• Appendix D: We discuss the connection between nullspace components and the theoretical framework of Gradient Starvation (Pezeshki et al., 2020) .• Appendix E: We include a formal discussion of weight decay and how it relates to the nullspace phenomenon• Appendix F: We report results for projection on in-distribution generalization for Rotated-MNIST, PACS, TerraIncognita, and DomainNet.• Appendix G: We report additional details for the data poisoning experiment.• Appendix H: We interpolate between test datapoints ẑ (with components in the nullspace) to the projected test datapoints ẑm and study how accuracy varies as we move between them• Appendix I: We present the small convolutional architecture used for all experiments on Rotated-MNIST.• Appendix J: We include plots of out-of-distribution accuracies after projection into all the different bases, for all networks we consider across every dataset.• Appendix K: We study the extent to which the optimized basis can help defend against adversarial attacks.• Appendix L: We include pseudocode for computing the oracle and layer-oracle accuracies reported in the paper.• Appendix M: We include pseudocode for obtaining our optimized basis by optimizing equation 3.

A PER-DOMAIN DOMAIN GENERALIZATION RESULTS

In this section, we show per-domain results of Table 2 from the main text. For each test domain, we report the mean accuracy and SEM over three nets selected using DomainBed search protocols, after projection into different low rank feature spaces. 

K ROBUSTNESS TO ADVERSARIAL PERTURBATIONS

In the main paper, we showed how alleviating nullspace failure leads to improved OOD generalization.Can we also improve performance on adversarial examples generated to fool the full network by projecting to a limited number of components? To check this, we perform an experiment where we

