IS ATTENTION BETTER THAN MATRIX DECOMPOSITION?

Abstract

As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank completion problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants. Code is available.

1. INTRODUCTION

Since self-attention and transformer (Vaswani et al., 2017) showed significant advantages over recurrent neural networks and convolutional neural networks in capturing long-distance dependencies, attention has been widely adopted by computer vision (Wang et al., 2018; Zhang et al., 2019a) and natural language processing (Devlin et al., 2019) for global information mining. However, is hand-crafted attention irreplaceable when modeling the global context? This paper focuses on a new approach to design global context modules. The key idea is, if we formulate the inductive bias like the global context into an objective function, the optimization algorithm to minimize the objective function can construct a computational graph, i.e., the architecture we need in the networks. We particularize this idea by developing a counterpart for the most representative global context module, self-attention. Considering extracting global information in the networks as finding a dictionary and the corresponding codes to capture the inherent correlation, we model the context discovery as low-rank completion of the input tensor and solve it via matrix decomposition. This paper then proposes a global correlation block, Hamburger, by employing matrix decomposition to factorize the learned representation into sub-matrices so as to recover the clean low-rank signal subspace. The iterative optimization algorithm to solve matrix decomposition defines the central computational graph, i.e., Hamburger's architecture. Our work takes advantage of the matrix decomposition models as the foundation of Hamburger, including Vector Quantization (VQ) (Gray & Neuhoff, 1998) , Concept Decomposition (CD) (Dhillon & Modha, 2001) , and Non-negative Matrix Factorization (NMF) (Lee & Seung, 1999) . Additionally, instead of directly applying Back-Propagation Through Time (BPTT) algorithm (Werbos et al., 1990) to differentiate the iterative optimization, we adopt a truncated BPTT algorithm, i.e., one-step gradient, to back-propagate the gradient effectively. We illustrate the advantages of Hamburger in the fundamental vision tasks where global information has been proven crucial, including semantic segmentation and image generation. The experiments prove that optimization-designed Hamburger can perform competitively with state-of-the-art attention models when avoiding the unstable gradient back-propagated through the iterative computational graph of MD. Hamburger sets new state-ofthe-art records on the PASCAL VOC dataset (Everingham et al., 2010) and PASCAL Context dataset (Mottaghi et al., 2014) for semantic segmentation and surpasses existing attention modules for GANs in the large scale image generation on ImageNet (Deng et al., 2009) . The contributions of this paper are listed as follows: • We show a white-box approach to design global information blocks, i.e., by turning the optimization algorithm that minimizes an objective function, in which modeling the global correlation is formulated as a low-rank completion problem, into the architecture. 

2.1. WARM UP

Since matrix decomposition is pivotal to the proposed Hamburger, we first review the idea of matrix decomposition. A common view is that matrix decomposition factorizes the observed matrix into a product of several sub-matrices, e.g., Singular Value Decomposition. However, a more illuminating perspective is that, by assuming the generation process, matrix decomposition acts as the inverse of the generation, disassembling the atoms that make up the complex data. From the reconstruction of the original matrices, matrix decomposition recovers the latent structure of observed data. Suppose that the given data are arranged as the columns of a large matrix X = [x 1 , • • • , x n ] ∈ R d×n . A general assumption is that there is a low-dimensional subspace, or a union of multiple subspaces hidden in X. That is, there exists a dictionary matrix D = [d 1 , • • • , d r ] ∈ R d×r and corresponding codes C = [c 1 , • • • , c n ] ∈ R r×n that X can be expressed as generation ← --------- X = X + E = DC + E, ---------→ decomposition (1) where X ∈ R d×n is the output low-rank reconstruction, and E ∈ R d×n is the noise matrix to be discarded. Here we assume that the recovered matrix X has the low-rank property, such that rank( X) ≤ min(rank(D), rank(C)) ≤ r min(d, n). (2) Different MDs can be derived by assuming structures to matrices D, C, and E (Kolda & Bader, 2009; Udell et al., 2016) . MD is usually formulated as an objective with various constraints and then solved by optimization algorithms, with classic applications to image denoising (Wright et al., 2009; Lu et al., 2014 ), inpainting (Mairal et al., 2010) , and feature extraction (Zhang et al., 2012) .

2.2. PROPOSED METHOD

We focus on building global context modules for the networks without painstaking hand-crafted design. Before starting our discussion, we review the representative hand-designed context block self-attention pithily. The attention mechanism aims at finding a group of concepts for further conscious reasoning from massive unconscious context (Xu et al., 2015; Bengio, 2017; Goyal et al., 2019) . As a representative, self-attention (Vaswani et al., 2017) is proposed for learning long-range dependencies in machine translation, Attention (Q, K, V ) = softmax QK √ d V ,



• We propose Hamburger, a light yet powerful global context module with O(n) complexity, surpassing various attention modules on semantic segmentation and image generation. • We figure out that the main obstacle of applying MD in the networks is the unstable backward gradient through its iterative optimization algorithm. As a pragmatic solution, the proposed one-step gradient facilitates the training of Hamburger with MDs.

