NULLSPACE OF VISION TRANSFORMERS AND WHAT DOES IT TELL US?

Abstract

Nullspace of a linear mapping is the subspace which is mapped to the zero vector. For a linear map, adding an element of the nullspace to its input has no effect on the output of the mapping. We position this work as an exposition towards answering one simple question, "Does a vision transformer have a non-trivial nullspace?" If TRUE, this would imply that adding elements from this non-trivial nullspace to an input will have no effect on the output of the network. This finding can eventually lead us closer to understanding the generalization properties of vision transformers. In this paper, we first demonstrate that provably a non-trivial nullspace exists for a particular class of vision transformers. This proof is drawn by simply computing the nullspace of the patch embedding matrices. We extend this idea to the non-linear layers of the vision transformer and show that it is possible to learn a non-linear counterpart to the nullspace via simple optimisations for any vision transformer. Subsequently, we perform studies to understand robustness properties of ViTs under nullspace noise. Under robustness, we investigate prediction stability, and fooling properties (network and interpretation) of the noise. Lastly, we provide image watermarking as an application of nullspace noise.

1. INTRODUCTION

In recent years, deep learning models have seen tremendous success. So much so, that now these models are the standards for tasks ranging from image recognition (He et al., 2016; Dosovitskiy et al., 2021; Wortsman et al., 2022) , object detection (Dai et al., 2021a; Guan et al., 2022) , scene segmentation (Fang et al., 2021; Zhang et al., 2022b) , language translation (Devlin et al., 2019; Xue et al., 2022) , speech recognition (Nagrani et al., 2021; Chen et al.) among many others. We are now observing their application to various novel problems such as predicting protein structures (Jumper et al., 2021) , generative modelling (Dhariwal & Nichol, 2021) , autonomous driving (Grigorescu et al., 2020) , solving differential equations (Lample & Charton, 2020) to name a few. It is only fair to assume that we shall be witnessing an accelerated adoption of deep learning models into our daily lives as the years go by. In computer vision, most of the architecture types can be broadly classified into two groups: convolution neural networks (CNNs) and transformers based on their building blocks. CNNs have gained overwhelming popularity since their state of the art redefining performance (Ciregan et al., 2012; Krizhevsky et al., 2012) on the ImageNet challenge (Russakovsky et al., 2015) . The main characteristic of CNNs is the use of trainable kernels to perform convolution operation in a strided fashion on the inputs (LeCun et al., 1989; O'Shea & Nash, 2015) . On the contrary, transformer based architectures like Vision Transformers (ViTs) (Dosovitskiy et al., 2021) are more recent inventions compared to the CNNs. Key aspect of transformer based architectures is the adaptation and utilisation of self/cross-attention modules (Vaswani et al., 2017) . In the short span of three years, transformer based classes of vision models have gained tremendous popularity. They compete with CNNs on various computer vision tasks (Touvron et al., 2021; Carion et al., 2020; Arnab et al., 2021) . The focus of our work is a specific architecture, the Vision Transformer. It processes an input image by first splitting it into several non-overlapping patches followed by a linear projection which is then processed by a vanilla transformer. Since their introduction in 2020, ViTs have been the source of inspiration for several recent novel architectures (Ali et al., 2021; Li et al., 2022b; Liu et al., 2021) . Also, researchers have made multitude of recent For the functions in these three cases, there exists some nullspace, and the output of the function with respect to the input will remain the same no matter how much perturbation is introduced to the input along the nullspace. Also, the nullspace is function-specific (model-specific) and will not vary for different samples. = ! = ! = ! # = $ ! % ! + $ " % " = ' ! # = ! # # = ' if # # = % ! + (, % " - ($ ! $ " , $ " ≠ 0 ∀ ( ∈ ℝ discoveries about the properties of transformer based architectures with ViTs as the focal point of their studies (Naseer et al., 2021; Mahmood et al., 2021; Minderer et al., 2021) . This decision to prioritize ViTs stems from the simplicity of ViTs resulting in fewer engineering hurdles and high flexibility and versatility which leads to wider adoption across domains. Despite the architecture is still relevatively new, the community has made several insightful findings along the investigation of the working mechanisms of it. For example, Naseer et al. ( 2021) showed that ViTs are robust to occlusions, input perturbations and domain shifts, as well as bearing a lower reliance on local textures as compared to that of CNNs as showed earlier by Geirhos et al. (2018) . These findings were also corroborated in a recent work (Zhang et al., 2022a ). Further, Zhou et al. (2022) verified that the robustness of ViTs to several corruptions is primarily due to the self-attention blocks. However, a recent finding by Pinto et al. ( 2021) attributes relatively better performance of ViTs on out-of-distribution generalisation to flawed comparison of models solely based on number of parameters. On a related topic, along the study of adversarial robustness, Mahmood et al. (2021) found that, unlike CNNs, adversarial examples for ViTs had lower transferability across architectures, which inspired them to build a more robust ensemble. On the other hand, as per Bai et al. (2021) , the improved generalisation of ViTs is benefited by the self-attention framework employed by the network. They also reported ViTs are as vulnerable to adversarial attacks (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016) as their CNN counterparts under a fair evaluation. Moreover, they found no evidence suggesting that larger datasets helped ViTs more than CNNs in improving generalisation. These two findings are in conflict with earlier works inspecting ViTs (Bhojanapalli et al., 2021; Shao et al., 2021) . On the aspect of neural network calibration (Guo et al., 2017) , Minderer et al. (2021) observed that recent architectures, the likes of ViTs, were much better calibrated in their prediction scores. Finally, the architectural differences between CNNs and ViTs naturally result in the differences of the inductive biases (Raghu et al., 2021) Though no single architecture type comes off as the clear winner, it is fair to believe that ViTs are in general the primary focus of the community when studying transformer architectures in vision. Sharing the motivation to explore and further our understanding of ViTs akin to previous works, in this paper, we aim to highlight their untouched aspect: the nullspace of vision transformers. Nullspace of a linear map f : X → Y is the subspace of X which is mapped to 0. Formally, ∆ = {x|f (x) = 0}. Why is nullspace important? A seemingly innocent concept of nullspace describes a very interesting property of ViTs. If such a subspace exists for ViTs, then it would imply that the network is inherently robust to certain types of input perturbations (or corruptions). That is, simply sampling



Figure 1: An illustration of the nullspace in three cases (projection function case, left top; linear function case, left bottom; vision transformer case, right).For the functions in these three cases, there exists some nullspace, and the output of the function with respect to the input will remain the same no matter how much perturbation is introduced to the input along the nullspace. Also, the nullspace is function-specific (model-specific) and will not vary for different samples.

, and, in comparison to CNNs, ViTs lack inductive biases for local structure (edges, corners). To remedy this, Dai et al. (2021b); Xu et al. (2021); Li et al. (2022a) proposed solutions resort to either large amount of training data or inclusion of convolution layers.

