CHOPPING FORMERS IS WHAT YOU NEED IN VISION

Abstract

This work presents a new dynamic and fully-connected layer (DFC) that generalizes existing layers and is free from hard inductive biases. Then, it describes how to factorize the DFC weights efficiently. Using the Einstein notation's convention as framework, we define the DFC as a fully connected layer with the weight tensor created as a function of the input. DFC is the non-linear extension of the most general case of linear layer for neural network, and therefore all major neural network layers, from convolution to self-attention, are particular cases of DFCs. A stack of DFCs interleaved by non-linearities defines a new super-class of neural networks: Formers. DFC has four major characteristics: i) it is dynamic, ii) it is spatially adaptive, iii) it has a global receptive field, and iv) it mixes all the available channels' information. In their complete form, DFCs are powerful layers free from hard inductive biases, but their use is limited in practice by their prohibitive computational cost. To overcome this limitation and deploy DFC in real computer vision applications, we propose to use the CP Decomposition, showing that it is possible to factorize the DFC layer into smaller, manageable blocks without losing any representational power. Finally, we propose ChoP'D Former, an architecture making use of a new decomposition of the DFC layer into five sequential operations, each incorporating one characteristic of the original DFC tensor. Chop'D Former leverages dynamic gating and integral image, achieves global spatial reasoning with constant time complexity, and has a receptive field that can adapt depending on the task. Extensive experiments demonstrate that our ChoP'D Former is competitive with state-of-the-art results on three well-known CV benchmarks, namely Large-Scale Classification, Object Detection, and Instance Segmentation, suppressing the need for expensive architecture search and hyperparameter optimization.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have served as the undiscussed cornerstone of Computer Vision (CV) for the past decade thanks to convolutions, which, despite the hard inductive biases of locally connected and shared weights, are able to summarize spatial content very efficiently (Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; He et al., 2016; Howard et al., 2017; Tan & Le, 2019) . Nevertheless, in the 2020s, with the availability of more abundant computing resources, the role of convolutions has been challenged by the advent of Transformers (Vaswani et al., 2017; Dosovitskiy et al., 2020) and a new "spatial-mixing" module, called Self-Attention, characterized by lighter inductive biases and high complexity. The success of Vision Transformers (ViT) has long been attributed to Self-Attention. However, new findings have recently questioned this narrative. 2021) comment on the close link between convolution and Self-Attention formulations, hence blurring the line between these seemingly orthogonal operators. Here, we take a new step toward bridging the gap between CNNs and Transformers by providing a unifying and intuitive formulation that clarifies spatial modules' role in modern architectures and links existing work together. First, we use Einstein's tensor notation combined with tensor CP Decomposition to provide a practical yet principled analysis of existing literature. In essence, the principal ingredients in deep learning architectures are multi-dimensional operations that can naturally be written as decomposed tensor expressions. Here, the Einstein notation provides an elegant way to analyze neural network operators by highlighting differences among layers with an intuitive notation that simplifies multi-dimensional matrix algebra (Kolda & Bader, 2009; Panagakis et al., 2021; Hayashi et al., 2019) with no compromises in formal accuracy. Under this lens, we formalize a generalization of existing layers with a new dynamic, spatially adaptive, and fully connected building block for Neural Networks (the DFC) that represents the general -but computationally complex -operation of extracting the complete set of interactions within the input. Second, we use DFCs to define a super-class of neural networks, which we call Formers, where the dense and heavy DFC operators are used to create hierarchical representations of the input images. Then, to target real-world applications, usually bounded by tight computational budgets, we explore the use of CP Decomposition to decrease Formers' complexity and integrate different inductive biases in their design. In this light, we show that Transformers' architectures can be seen as one of the possible instances of Formers and go a step further by proposing a new ChoP'D Former variant. ChoP'D Former leverages CP Decomposition, dynamic gating, and integral images to "chop" the general but prohibitively complex DFC into a sequence of efficient transformations that have the potential to retain its full representational power. In particular, we identify five specific modules that can model the dynamicity with respect to the input, the adaptivity with respect to the spatial positions, and the long-range interactions via a dynamic receptive field with an overall complexity independent of the number of input tokens. Finally, this new perspective allows us to justify the empirical success of (Trans)Formers and disentangle the contributions of each of their characteristics. To do so, we programmatically compare different layers and CP-Decomposed architectures on various small-scale and large-scale CV tasks. Our experiments indicate that CP-Decomposed DFC layers can effectively approximate the full DFC at a significantly reduced cost, considerably outperforming its simplified variants. In conclusion, our contributions can be summarized as follows: • We provide a unifying view on building blocks for neural networks that generalizes and compares existing methods via Einstein's notation and CP Decomposition, with a notation that deals with multi-dimensional tensor expressions without resorting to heavy tensor algebra. • We show how to use a complete tensor operator that is spatially adaptive, fully connected, and dynamic (DFC) to create general neural networks, which we dub "Formers". • We connect our formulation to existing architectures by showing how Transformer and its variants can be seen as a stack of CP-Decomposed DFC operands for neural networks. • We propose ChoP'D Former, a new variant of Former architecture, which is able to approximate the full DFC with a complexity comparable to a convolution with a small kernel, and is able to match, if not improve, SoTA performance on several benchmarks, including large scale classification, object detection, and instance segmentation.

2. EINSTEIN NOTATION FOR NEURAL NETWORKS

At their core, neural networks -and deep learning architectures in particular -are commonly built as a sequence of tensor operations (i.e., building blocks) interleaved by point-wise non-linearities. Tremendous interest has been dedicated to the form of such building blocks (e.g., "MLP", "Convolution", "Residual Block", "Dense Block", "Deformable Conv", "Attention", "Dynamic-Conv", etc.) as these are the critical components to extract various meaningful information from the input. In this section, we present a general form of a neural network layer and showcase how the Einstein summation convention can be used as an alternative, short-hand, and self-contained way to represent and relate building blocks for neural networks. 



For example, d'Ascoli et al. (2021); Wu et al. (2021); Liu et al. (2021b) highlight the importance of convolutional biases in Transformers for CV. Liu et al. (2022); Yu et al. (2022) demonstrate how macro design choices and training procedures alone can be sufficient to achieve competitive performance regardless of the specific spatial module used. Finally, Cordonnier et al. (2019); Han et al. (

Einstein notation. In the rest of the paper, we adopt the notation ofLaue et al. (2020). Tensors are denoted with uppercase letters and indices to the dimensions of the tensors are denoted in lowercase subscripts. For instance X ijk ∈ R I×J×K is a three-dimensional tensor of size I × J × K with three modes (or dimensions) indexed by i ∈ [1, I], j ∈ [1, J], and k ∈ [1, K]. Using the Einstein notation,

