CHOPPING FORMERS IS WHAT YOU NEED IN VISION

Abstract

This work presents a new dynamic and fully-connected layer (DFC) that generalizes existing layers and is free from hard inductive biases. Then, it describes how to factorize the DFC weights efficiently. Using the Einstein notation's convention as framework, we define the DFC as a fully connected layer with the weight tensor created as a function of the input. DFC is the non-linear extension of the most general case of linear layer for neural network, and therefore all major neural network layers, from convolution to self-attention, are particular cases of DFCs. A stack of DFCs interleaved by non-linearities defines a new super-class of neural networks: Formers. DFC has four major characteristics: i) it is dynamic, ii) it is spatially adaptive, iii) it has a global receptive field, and iv) it mixes all the available channels' information. In their complete form, DFCs are powerful layers free from hard inductive biases, but their use is limited in practice by their prohibitive computational cost. To overcome this limitation and deploy DFC in real computer vision applications, we propose to use the CP Decomposition, showing that it is possible to factorize the DFC layer into smaller, manageable blocks without losing any representational power. Finally, we propose ChoP'D Former, an architecture making use of a new decomposition of the DFC layer into five sequential operations, each incorporating one characteristic of the original DFC tensor. Chop'D Former leverages dynamic gating and integral image, achieves global spatial reasoning with constant time complexity, and has a receptive field that can adapt depending on the task. Extensive experiments demonstrate that our ChoP'D Former is competitive with state-of-the-art results on three well-known CV benchmarks, namely Large-Scale Classification, Object Detection, and Instance Segmentation, suppressing the need for expensive architecture search and hyperparameter optimization.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have served as the undiscussed cornerstone of Computer Vision (CV) for the past decade thanks to convolutions, which, despite the hard inductive biases of locally connected and shared weights, are able to summarize spatial content very efficiently (Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; He et al., 2016; Howard et al., 2017; Tan & Le, 2019) . Nevertheless, in the 2020s, with the availability of more abundant computing resources, the role of convolutions has been challenged by the advent of Transformers (Vaswani et al., 2017; Dosovitskiy et al., 2020) and a new "spatial-mixing" module, called Self-Attention, characterized by lighter inductive biases and high complexity. The success of Vision Transformers (ViT) has long been attributed to Self-Attention. However, new findings have recently questioned this narrative. 2021) comment on the close link between convolution and Self-Attention formulations, hence blurring the line between these seemingly orthogonal operators. Here, we take a new step toward bridging the gap between CNNs and Transformers by providing a unifying and intuitive formulation that clarifies spatial modules' role in modern architectures and links existing work together.



For example, d'Ascoli et al. (2021); Wu et al. (2021); Liu et al. (2021b) highlight the importance of convolutional biases in Transformers for CV. Liu et al. (2022); Yu et al. (2022) demonstrate how macro design choices and training procedures alone can be sufficient to achieve competitive performance regardless of the specific spatial module used. Finally, Cordonnier et al. (2019); Han et al. (

