NULLSPACE OF VISION TRANSFORMERS AND WHAT DOES IT TELL US?

Abstract

Nullspace of a linear mapping is the subspace which is mapped to the zero vector. For a linear map, adding an element of the nullspace to its input has no effect on the output of the mapping. We position this work as an exposition towards answering one simple question, "Does a vision transformer have a non-trivial nullspace?" If TRUE, this would imply that adding elements from this non-trivial nullspace to an input will have no effect on the output of the network. This finding can eventually lead us closer to understanding the generalization properties of vision transformers. In this paper, we first demonstrate that provably a non-trivial nullspace exists for a particular class of vision transformers. This proof is drawn by simply computing the nullspace of the patch embedding matrices. We extend this idea to the non-linear layers of the vision transformer and show that it is possible to learn a non-linear counterpart to the nullspace via simple optimisations for any vision transformer. Subsequently, we perform studies to understand robustness properties of ViTs under nullspace noise. Under robustness, we investigate prediction stability, and fooling properties (network and interpretation) of the noise. Lastly, we provide image watermarking as an application of nullspace noise.

1. INTRODUCTION

In recent years, deep learning models have seen tremendous success. So much so, that now these models are the standards for tasks ranging from image recognition (He et al., 2016; Dosovitskiy et al., 2021; Wortsman et al., 2022 ), object detection (Dai et al., 2021a; Guan et al., 2022) , scene segmentation (Fang et al., 2021; Zhang et al., 2022b ), language translation (Devlin et al., 2019; Xue et al., 2022 ), speech recognition (Nagrani et al., 2021; Chen et al.) among many others. We are now observing their application to various novel problems such as predicting protein structures (Jumper et al., 2021) , generative modelling (Dhariwal & Nichol, 2021), autonomous driving (Grigorescu et al., 2020) , solving differential equations (Lample & Charton, 2020) to name a few. It is only fair to assume that we shall be witnessing an accelerated adoption of deep learning models into our daily lives as the years go by. In computer vision, most of the architecture types can be broadly classified into two groups: convolution neural networks (CNNs) and transformers based on their building blocks. CNNs have gained overwhelming popularity since their state of the art redefining performance (Ciregan et al., 2012; Krizhevsky et al., 2012) on the ImageNet challenge (Russakovsky et al., 2015) . The main characteristic of CNNs is the use of trainable kernels to perform convolution operation in a strided fashion on the inputs (LeCun et al., 1989; O'Shea & Nash, 2015) . On the contrary, transformer based architectures like Vision Transformers (ViTs) (Dosovitskiy et al., 2021) are more recent inventions compared to the CNNs. Key aspect of transformer based architectures is the adaptation and utilisation of self/cross-attention modules (Vaswani et al., 2017) . In the short span of three years, transformer based classes of vision models have gained tremendous popularity. They compete with CNNs on various computer vision tasks (Touvron et al., 2021; Carion et al., 2020; Arnab et al., 2021) . The focus of our work is a specific architecture, the Vision Transformer. It processes an input image by first splitting it into several non-overlapping patches followed by a linear projection which is then processed by a vanilla transformer. Since their introduction in 2020, ViTs have been the source of inspiration for several recent novel architectures (Ali et al., 2021; Li et al., 2022b; Liu et al., 2021) . Also, researchers have made multitude of recent

