CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER ATTENTION

Abstract

More and more evidence has shown that strengthening layer interactions can enhance the representation power of a deep neural network, while self-attention excels at learning interdependencies by retrieving query-activated information. Motivated by this, we devise a cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), that sends a query representation of the current layer to all previous layers to retrieve query-related information from different levels of receptive fields. A light-weighted version of MRLA is also proposed to reduce the quadratic computation cost. The proposed layer attention mechanism can enrich the representation power of many state-of-the-art vision networks, including CNNs and vision transformers. Its effectiveness has been extensively evaluated in image classification, object detection and instance segmentation tasks, where improvements can be consistently observed. For example, our MRLA can improve 1.6% Top-1 accuracy on ResNet-50, while only introducing 0.16M parameters and 0.07B FLOPs. Surprisingly, it can boost the performances by a large margin of 3-4% box AP and mask AP in dense prediction tasks. Our code is available at https://github.com/joyfang1106/MRLA.

1. INTRODUCTION

Growing evidence indicates that strengthening layer interactions can encourage the information flow of a deep neural network (He et al., 2016; Huang et al., 2017; Zhao et al., 2021) . For example, in vision networks, the receptive fields are usually enlarged as layers are stacked. These hierarchical receptive fields play different roles in extracting features: local texture features are captured by small receptive fields, while global semantic features are captured by large receptive fields. Hence encouraging layer interactions can enhance the representation power of networks by combining different levels of features. Previous empirical studies also support the necessity of building interdependencies across layers. ResNet (He et al., 2016) proposed to add a skip connection between two consecutive layers. DenseNet (Huang et al., 2017) further reinforced layer interactions by making layers accessible to all subsequent layers within a stage. Recently, GLOM (Hinton, 2021) adopted an intensely interacted architecture that includes bottom-up, top-down, and same-level interactions, attempting to represent part-whole hierarchies in a neural network. In the meantime, the attention mechanism has proven itself in learning interdependencies by retrieving query-activated information in deep neural networks. Current works about attention lay much emphasis on amplifying interactions within a layer (Hu et al., 2018; Woo et al., 2018; Dosovitskiy et al., 2021) . They implement attention on channels, spatial locations, and patches; however, none of them consider attention on layers, which are actually the higher-level features of a network. It is then natural to ask: "Can attention replicate its success in strengthening layer interactions?" This paper gives a positive answer. Specifically, starting from the vanilla attention, we first give a formal definition of layer attention. Under this definition, a query representation of the current layer is sent to all previous layers to retrieve related information from hierarchical receptive fields. The resulting attention scores concretely depict the cross-layer dependencies, which also quantify the importance of hierarchical information to the query layer. Furthermore, utilizing the sequential structure of networks, we suggest a way to perform layer attention recurrently in Section 3.3 and call it recurrent layer attention (RLA). A multi-head design is naturally introduced to diversify representation subspaces, and hence comes multi-head RLA (MRLA). Figure 1 (a) visualizes the layer attention scores yielded by MRLA at Eq. ( 6). Interestingly, most layers pay more attention to the first layer within the stage, verifying our motivation for retrospectively retrieving information. Inheriting from the vanilla attention, MRLA has a quadratic complexity of O(T 2 ), where T is the depth of a network. When applied to very deep networks, this will incur a high computation cost and possibly the out-of-memory problem. To mitigate the issues, this paper makes an attempt to devise a light-weighted version of MRLA with linear complexity of O(T ). After imposing a linearized approximation, MRLA becomes more efficient and has a broader sphere of applications. To our best knowledge, our work is the first attempt to systematically study cross-layer dependencies via attention. It is different from the information aggregation in DenseNet because the latter aggregates all previous layers' features in a channel-wise way regardless of which layer a feature comes from. OmniNet (Tay et al., 2021) and ACLA (Wang et al., 2022b) follow the same spirit as DensetNet. They allow each token from each layer to attend to all tokens from all previous layers. Essentially, both of them neglect the layer identity of each token. By contrast, we stand upon the layer attention to retrospectively retrieve query-related features from previous layers. Besides, to bypass the high computation cost, OmniNet divides the network into several partitions and inserts the omnidirectional attention block only after the last layer of each partition; and ACLA samples tokens with gates from each layer. Instead, our light-weighted version of MRLA can be easily applied to each layer. The two versions of MRLA can improve many state-of-the-art (SOTA) vision networks, such as convolutional neural networks (CNNs) and vision transformers. We have conducted extensive experiments across various tasks, including image classification, object detection and instance segmentation. The experiment results show that our MRLA performs favorably against its counterparts. Especially in dense prediction tasks, it can outperform other SOTA networks by a large margin. The visualizations (see Appendix B.5) show that our MRLA can retrieve local texture features with positional information from previous layers, which may account for its remarkable success in dense prediction. The main contributions of this paper are summarized below: (1) A novel layer attention, MRLA, is proposed to strengthen cross-layer interactions by retrieving query-related information from previous layers. (2) A light-weighted version of MRLA with linear complexity is further devised to make cross-layer attention feasible to more deep networks. (3) We show that MRLA is compatible with many networks, and validate its effectiveness across a broad range of tasks on benchmark datasets. (4) We investigate the important design elements of our MRLA block through an ablation study and provide guidelines for its applications on convolutional and transformer-based vision models.

2. RELATED WORK

Layer Interaction Apart from the works mentioned above, other CNN-based and transformerbased models also put much effort into strengthening layer interactions. DIANet (Huang et al., 2020) utilized a parameter-sharing LSTM along the network depth to model the cross-channel relationships with the help of previous layers' information. CN-CNN (Guo et al., 2022 ) combined DIANet's LSTM and spatial and channel attention for feature fusion across layers. A similar RNN module was applied along the network depth to recurrently aggregate layer-wise information in RLANet (Zhao et al., 2021) . To distinguish it from our RLA, we rename the former as RLA g in the following. RealFormer (He et al., 2021) and EA-Transformer (Wang et al., 2021) both added attention scores in the previous layer to the current one, connecting the layers by residual attention. Bapna et al. (2018) modified the encoder-decoder layers by letting the decoders attend to all encoder layers. However, maintaining the features from all encoders suffers from a high memory cost, especially for high-dimensional features. Attention Mechanism in CNNs CNNs have dominated vision tasks by serving as backbone networks in the past decades. Recently, attention mechanisms have been incorporated into CNNs with large receptive fields to capture long-range dependencies. SENet (Hu et al., 2018) and ECANet (Wang et al., 2020b) are two typical channel attention modules, which adaptively recalibrated channel-wise features by modelling the cross-channel dependencies. The Squeeze-and-Excitation (SE) block in SENet was later employed by the architectures of MobileNetV3 (Howard et al., 2019) and EfficientNet (Tan & Le, 2019) . CBAM (Woo et al., 2018) first combined channel and spatial attention to emphasize meaningful features along the two principal dimensions. Pixel-level pairwise interactions across all spatial positions were captured by the non-local (NL) block in NLNet (Wang et al., 2018) . GCNet (Cao et al., 2019) simplified the query-specific operation in the NL block to a query-independent one while maintaining the performances. CANet (Li et al., 2021 ) also extended the NL block for small object detection and integrated different layers' features by resizing and averaging. TDAM (Jaiswal et al., 2022) and BANet (Zhao et al., 2022) both perform joint attention on low-and high-level feature maps within a convolution block, which are considered as inner-layer attention in our paper. Transformer-based Vision Networks Motivated by the success of Transformer (Vaswani et al., 2017) in Natural Language Processing (NLP), many researchers have applied transformer-based architectures to vision domains. The first is ViT (Dosovitskiy et al., 2021) , which adapts a standard convolution-free Transformer to image classification by embedding an image into a sequence of patches. However, it relies on a large-scale pre-training to perform comparably with SOTA CNNs. This issue can be mitigated by introducing an inductive bias that the original Transformer lacks. DeiT (Touvron et al., 2021) adopted the knowledge distillation procedure to learn the inductive bias from a CNN teacher model. As a result, it only needs to be trained on a middle-size dataset, ImageNet-1K, and achieves competitive results as CNNs. CeiT (Yuan et al., 2021) introduced convolutions to the patch embedding process and the feed-forward network of ViT. Swin Transformer (Liu et al., 2021) employed a hierarchical design and a shifted-window strategy to imitate a CNN-based model.

3. LAYER ATTENTION AND RECURRENT LAYER ATTENTION

This section first recalls the mathematical formulation of self-attention. Then, it gives the definition of layer attention, and formulates the recurrent layer attention as well as its multi-head version.

3.1. REVISITING ATTENTION

Let X ∈ R T ×Din be an input matrix consisting of T tokens with D in dimensions each, and we consider a self-attention with an output matrix O ∈ R T ×Dout . While in NLP each token corresponds to a word in a sentence, the same formalism can be applied to any sequence of T discrete objects, e.g., pixels and feature maps. The self-attention mechanism first derives the query, key and value matrices Q, K and V by projecting X with linear transformations, i.e., Q = XW Q with W Q ∈ R Din×D k , K = XW K with W K ∈ R Din×D k , and V = XW V with W V ∈ R Din×Dout . Then, the output is given by: O = Self-Attention(X) := softmax( QK T √ D k )V = AV , where A = (a i,j ) is a T × T matrix. Here we adopt NumPy-like notations: for a matrix Y ∈ R I×J , Y i,: , Y :,j , and y i,j are its i-th row, j-th column, and (i, j)-th element, respectively. Moreover, [T ] refers to the set of indices 1 to T . Then, a self-attention mechanism mapping any query token t ∈ [T ] from D in to D out dimensions can be formulated in an additive form: O t,: = A t,: V = T s=1 a t,s V s,: .

3.2. LAYER ATTENTION

For a deep neural network, let X t ∈ R 1×D be the output feature of its t-th layer, where t ∈ [T ] and T is the number of layers. We consider an attention mechanism with X t attending to all previous layers and itself, i.e., the input matrix is (X 1 , ..., X t ) ∈ R t×D , and each X s is treated as a token. Assuming D in = D out = D, we first derive the query, key and value for the t-th layer attention below, Q t =f t Q (X t ) ∈ R 1×D k , K t = Concat[f t K (X 1 ), ..., f t K (X t )] and V t = Concat[f t V (X 1 ), ..., f t V (X t )], where K t ∈ R t×D k and V t ∈ R t×D . Here, f t Q denotes a function to extract current layer's information; f t K and f t V are functions that extract information from the 1st to t-th layers' features. Denote the output of the t-th layer attention by O t ∈ R 1×D , and then the layer attention with the t-th layer as the query is defined as follows: O t = Q t (K t ) T V t = t s=1 Q t (K t s,: ) T V t s,: , which has a similar additive form as in Eq. ( 1). The softmax and the scale factor √ D k are omitted here for clarity since the normalization can be performed easily in practice. The proposed layer attention first depicts the dependencies between the t-th and s-th layers with the attention score Q t (K t s,: ) T , and then use it to reweight the transformed layer feature V t s,: .

3.3. RECURRENT LAYER ATTENTION

In Eq. ( 3), the computation cost associated with f t K and f t V increases with t. Taking advantage of the sequential structure of layers, we make a natural simplification (see Appendix A.2 for details): K t = Concat[f 1 K (X 1 ), ..., f t K (X t )] and V t = Concat[f 1 V (X 1 ), ..., f t V (X t )]. The simplification allows the key and value matrices of the t-th layer to inherit from the preceding ones, i.e., K t = Concat[K t-1 , f t K (X t )] and V t = Concat[V t-1 , f t V (X t )], which avoids the redundancy induced by repeatedly deriving the keys and values for the same layer with different transformation functions. Based on this simplification, we can rewrite Eq. (4) into O t = t-1 s=1 Q t (K t s,: ) T V t s,: + Q t (K t t,: ) T V t t,: = t-1 s=1 Q t (K t-1 s,: ) T V t-1 s,: + Q t (K t t,: ) T V t t,: = Q t (K t-1 ) T V t-1 + Q t (K t t,: ) T V t t,: , where the first term corresponds to the attention with the t-th layer sending the query to all previous layers. Compared to Eq. ( 4), K t and V t are now constructed in a recurrent way, i.e., K t-1 and V t-1 are reused. We therefore call Eq. ( 6) the recurrent layer attention (RLA), which significantly saves the computation cost. A schematic diagram is provided in Figure 1 (b). Nonetheless, RLA suffers from a quadratic complexity of the network depth since the first term in Eq. ( 6) still needs to be recomputed for each layer. Besides, for an image, the feature dimension D of V t is equal to H × W × C. For a very deep network with large t, the expanding size of V t will result in the out-of-memory problem. Both of them limit the layer attention to a shallow or narrow network. Luckily, several techniques are available for linearizing the complexity of self-attention for the tokens within a layer. If workable at the layer level, they can be the tool that facilitates the layer attention to adapt for very deep networks under training time and memory constraints. We provide an example of linear RLA with the method proposed by Katharopoulos et al. (2020) in Sec.A.3 of Appendix. Here we suggest another tailor-made method that utilizes an approximation of Q t (K t-1 ) T V t-1 to linearize RLA. Denote the element-wise product by ⊙, and then there exists a λ t q ∈ R 1×D k such that Q t = λ t q ⊙ Q t-1 . We speculate that query vectors at two consecutive layers have a similar pattern, i.e., Q t is roughly proportional to Q t-1 or the elements of λ t q have similar values (see Figure 2 (b) for empirical support). Consequently, where λ t o ∈ R 1×D intrinsically depends on Q t , Q t-1 and (K t-1 ) T V t-1 , and we set λ t o as a learnable vector since its computation is complicated (see Appendix A.2 for the detailed explanation on the approximation above). The learnable vector can adaptively bridge the gap between Q t and Q t-1 . Note that the approximation in Eq. ( 7) becomes equivalency when the elements of λ t q are the same, i.e., λ t q = c1 ∈ R 1×D k , and then λ t o = c1 ∈ R 1×D . Moreover, this approximation can be alleviated by the multi-head design in the next subsection. Q t (K t-1 ) T V t-1 = (λ t q ⊙ Q t-1 )(K t-1 ) T V t-1 ≈ λ t o ⊙ [Q t-1 (K t-1 ) T V t-1 ] = λ t o ⊙ O t-1 , With Eq. ( 7), an efficient and light-weighted RLA version with complexity O(T ) is suggested below: O t = λ t o ⊙ O t-1 + Q t (K t t,: ) T V t t,: = t-1 l=0 β l ⊙ Q t-l (K t-l t-l,: ) T V t-l t-l,: , where β 0 = 1, and β l = λ t o ⊙ ... ⊙ λ t-l+1 o for l ≥ 1. From the second equality, the extended RLA version is indeed a weighted average of past layers' features (V t-l t-l,: ). This is consistent with the broad definition of attention that many other attention mechanisms in computer vision use (Hu et al., 2018; Woo et al., 2018; Wang et al., 2020b) . It is also interesting to observe that the first equality admits the form of a generalized residual connection, leading to an easier implementation in practice.

3.4. MULTI-HEAD RECURRENT LAYER ATTENTION

Motivated by the multi-head self-attention (MHSA) in Transformer, this section comes up with a multi-head recurrent layer attention (MRLA) to allow information from diverse representation subspaces. We split the terms in Eq. ( 6) and Eq. ( 8) into H heads. Then, for head h ∈ [H], RLA and its light-weighted version can be reformulated as O t h = Q t h (K t h ) T V t h and O t h ≈ λ t o,h ⊙ O t-1 h + Q t h (K t h[t,:] ) T V t h[t,:] , respectively, where O t h ∈ R 1× D H is the h-th head's output of MRLA. The final outputs are obtained by concatenation, i.e. O t = Concat[O t 1 , ..., O t H ]. For convenience, we dub the MRLA and its light-weighted version as MRLA-base and MRLA-light, which are collectively referred to as MRLA. In contrast to Eq. ( 6) and Eq. ( 8) where a 3D image feature V t t,: is weighted by a scalar Q t (K t t,: ) T , the multi-head versions allow each layer's features to be adjusted by an enriched vector of length H, strengthening the representation power of the output features. In addition, for MRLA-light, as now the approximation in Eq. ( 7) is conducted within each head, i.e., Q t h (K t-1 h ) T V t-1 h ≈ λ t o,h ⊙ [Q t-1 h (K t-1 h ) T V t-1 h ], it will become an equality as long as λ t q,h = c h 1 ∈ R 1× D k H , which is a much more relaxed requirement. In particular, when there are D k heads, i.e., H = D k , the approximation in Eq. ( 7) always holds. If layer attention is applied to small networks or the minimum computation, time and memory cost are not chased, we recommend both MRLA-base and MRLA-light as they can equivalently retrieve useful information from previous layers. In case one has to consider the training time and memory footprints, MRLA-light is preferred as an effecient version that we adapt for deeper networks.

4. APPLICATIONS OF MRLA IN VISION NETWORKS

Recent networks are formed by deep stacks of similar blocks (layers), and therefore a MRLAbase/light block can be inserted right after a building layer in a network. CNNs Figure 2 (a) illustrates the detailed block design of MRLA-light in CNNs (see that of MRLAbase in Figure 4 of Appendix). Given the output of the t-th CNN block X t ∈ R 1×D , where D = H × W × C, we perform a global average pooling (GAP) to summarize the t-th layer's information. Then two 1D convolutions (Conv1D) are used to extract the query Q t and the key K t t,: , whose kernel sizes are adaptively determined by the strategy in Wang et al. (2020b) . A 3x3 depth-wise convolution (DWConv) is applied to directly get V t t,: . Here, the two Conv1D (together with GAP(•)) and DWConv correspond to the f t Q , f t K and f t V , respectively. We then divide the query, key and value into H heads along the channels. The output of previous MRLA block, O t-1 , is partitioned similarly. We set λ t o of size C instead of H × W × C in order to introduce fewer parameters, and it performs an element-wise product with O t-1 after being expanded. A sigmoid activation is added on the projection of query and key to scale the attention weights into [0, 1]. Vision Transformers Most of the block design is the same as in Figure 2 (a) except for the following changes. The output of the t-th block of a vision transformer is X t ∈ R 1×D , where D = (N +1)×C, N is the number of patches and C is the embedding dimension. We first split X t into patch tokens X t p ∈ R N ×C and a class token X t c ∈ R 1×C . Then the patch tokens that preserve spatial information are reshaped into X t p ∈ R √ N × √ N ×C as the input of our MRLA. Only previous patch tokens of the last MRLA output O t-1 p are brought into the MRLA block. Lastly, the patch tokens are reshaped to the initial dimension and concatenated with the class token as the next layer's input.

5. EXPERIMENTS

This section first evaluates our MRLAs by conducting experiments in image classification, object detection and instance segmentation. Then MRLA-light block is taken as an example to ablate its important design elements. All models are implemented by PyTorch toolkit on 4 V100 GPUs. More implementation details, results, comparisons and visualizations are provided in Appendix B.

5.1. IMAGENET CLASSIFICATION

We use the middle-sized ImageNet-1K dataset (Deng et al., 2009) directly. Our MRLAs are applied to the widely used ResNet (He et al., 2016) and the current SOTA EfficientNet (Tan & Le, 2019) , which are two general ConvNet families. For vision transformers, DeiT (Touvron et al., 2021) , CeiT (Yuan et al., 2021) and PVTv2 (Wang et al., 2022a) are considered. We compare our MRLAs with baselines and several SOTA attention methods using ResNet as a baseline model. Settings A hyperparameter d k is introduced to control the number of channels per MRLA head. We set d k = 32, 8 and 16 for ResNet, EfficientNet and vision transformers, respectively. To train these vision networks with our MRLAs, we follow the same data augmentation and training strategies as in their original papers (He et al., 2016; Tan & Le, 2019; Touvron et al., 2021; Wang et al., 2022a) .

Results

The performances of the different-sized SOTA networks with our MRLAs are reported in Table 1 . For a fair comparison, the results of ResNets from torchvision are replicated. We first observe that MRLA-base and MRLA-light have comparable performances when added to relatively small networks, verifying that the approximation in MRLA-light does not sacrifice too much accuracy. The out-of-memory problem occurs when MRLA-base is applied to ResNet-101 with the same batch size. Therefore, MRLA-light is recommended for deeper networks if the efficiency and computational resources are taken into account, and it can perform slightly better probably because of the additional flexible learning vector. We next compare our MRLAs with other attention methods using ResNets as baselines. Results show that our MRLAs are superior to SENet, CBAM, A 2 -Net, one NL, and ECA-Net. Especially among layer-interaction-related networks, our MRLAs outperform the DIANet and RLA g -Net, all of which beat the DenseNet of similar model size. MRLAs are also as competitive as AA-Net and GCNet with lower model complexity or fewer parameters. For EfficientNets, MRLAs introduce about 0.01M and 0.02M more parameters, leading to 1.3% and 1.1% increases in Top-1 accuracy for EfficientNet-B0/B1, respectively. It is worth noting that the architecture of EfficientNets is obtained via a thorough neural architecture search which is hard to be further improved. Consistent improvements are also observed in the transformer-based models. Specifically, our MRLAs can achieve 1.2% and 1.0% gains in terms of Top-1 accuracy on DeiT-T and CeiT-T, while both introduce +0.03M parameters, and MRLA-light only increases +0.04B FLOPs. With CeiT-T, we also validate that the FLOPs induced by our MRLA are nearly linear to the input resolution (See Appendix B.1.2). We additionally supplement a fair comparison with BANet (Zhao et al., 2022) in Appendix B. (Hu et al., 2018) 28.1 M 4.1 B 76.7 93.4 + CBAM (Woo et al., 2018) 28.1 M 4.2 B 77.3 93.7 + A 2 (Chen et al., 2018) 34.6 M 7.0 B 77.0 93.5 + AA (Bello et al., 2019) 27.1 M 4.5 B 77.7 93.8 + 1 NL (Wang et al., 2018) 29.0 M 4.4 B 77.2 93.5 + 1 GC (Cao et al., 2019) 26.9 M 4.1 B 77.3 93.5 + all GC (Cao et al., 2019) 29.4 M 4.2 B 77.7 93.7 + ECA (Wang et al., 2020b) 25.6 M 4.1 B 77.5 93.7 + DIA (Huang et al., 2020) 28.4 M -77.2 -+ RLAg (Zhao et al., 2021) 25.9 M 4. Assumption Validation To validate the assumption we make in Eq. ( 7) that Q t h is roughly proportional to Q t-1 h within each head, we compute the absolute value of the cosine similarity between them and visualize the histogram in Figure 2(b) . Note that if the absolute cosine similarity approaches 1, the desire that the elements of λ t q,h have similar values is reached. The validation is conducted by randomly sampling 5 images from each class of the ImageNet validation set and then classifying these images with the trained ResNet-50+MRLA-base model. The query vectors from each head of all MRLA-base blocks are extracted except for those belonging to the first layer within each stage, as the first layer only attends to itself. Published as a conference paper at ICLR 2023 

5.2. OBJECT DETECTION AND INSTANCE SEGMENTATION

This subsection validates the transferability and the generalization ability of our model in object detection and instance segmentation tasks using three typical object detection frameworks: Faster R-CNN (Ren et al., 2015) , RetinaNet (Lin et al., 2017) and Mask R-CNN (He et al., 2017) . Settings All experiments are conducted on MS COCO 2017 (Lin et al., 2014) , which contains 118K training, 5K validation and 20K test-dev images. All detectors are implemented by the open-source MMDetection toolkit (Chen et al., 2019) , where the commonly used settings and 1x training schedule (Hu et al., 2018; Wang et al., 2020b; Zhao et al., 2021 ) are adopted. ResNet-101 by 4.0% and 3.5%, respectively. The improvements on other metrics are also significant, e.g., 3-4% on AP 50 and 4-6% on AP L . In particular, the stricter criterion AP 75 can be boosted by 4-5%, suggesting a stronger localization performance. More excitingly, ResNet-50 with our MRLAs outperforms ResNet-101 by 2% on these detectors. Even when employing stronger backbones and detectors, the gains of our MRLAs are still substantial, demonstrating that our layer-level context modeling are complementary to the capacity of current models. Remarkably, they surpass all other models with large margins. Though RLA g and our MRLAs both strengthen layer interactions and thus bring the most performance gains, our layer attention outperforms layer aggregation in RLA g .

Results on Object Detection

Comparisons Using Mask R-CNN Table 3 shows MRLAs stand out with remarkable improvements on all the metrics. Especially, MRLA-light strikes a good balance between computational cost and notable gains. For example, it is superior to RLA g module and GC block, while using much lower model complexity. Even though BA-Net ('+ BA') utilizes better pre-trained weights obtained from more advanced ImageNet training settings, our approach still outperforms it in these tasks. In summary, Tables 2 and 3 demonstrate that MRLAs can be well generalized to various tasks, among which they bring extraordinary benefits to dense prediction tasks. We make a reasonable conjecture that low-level features with positional information from local receptive fields are better preserved through layer interactions, leading to these notable improvements (see Figures 6 and 7 in Appendix).

5.3. ABLATION STUDY

Different Variants of MRLA Due to the limited resources, we experiment with ResNet-50 model on ImageNet. We first compare the multi-head layer attention (MLA) in Eq. ( 4) and MRLA-base in (a). Then we ablate the main components of MRLA-light to further identify their effects: (b) entirely without previous information; (c) without 3x3 DWConv; (d) replacing the 1D convolutions with fully-connected (FC) layers; (e) with the identity connection, i.e., λ t o =1; (g) adding a 3x3 DWConv to each layer of the baseline model. We also compare different hyperparameters: (f) different d k , including the special case of channel-wise recurrent layer attention (CRLA), i.e., d k = 1. 4 Comparing (a) and MRLA-light validates our approximation in Eq. ( 7). Since the 3x3 DWConv mainly controls the additional computation cost, we compare with inserting a 3x3 DWConv layer after each layer of the original ResNet-50. Comparing (c) and (g) with ours shows the improvement in accuracy is not fully caused by increasing model complexity. Then the necessities of retrieving previous layers' information and introducing λ t o are investigated. (b) indicates strengthening layers' interdependencies can boost the performance notably; while (e) shows MRLA plays a much more influential role than a simple residual connection. Using the FC layer in (d) may be unnecessary because it is comparable to Conv1D but introduces more parameters. (f) shows that our MRLA with different hyperparameters is still superior to the original network.

6. CONCLUSION AND FUTURE WORK

This paper focuses on strengthening layer interactions by retrospectively retrieving information via attention in deep neural networks. To this end, we propose a multi-head recurrent layer attention mechanism and its light-weighted version. Its potential has been well demonstrated by the applications on mainstream CNN and transformer models with benchmark datasets. Remarkably, MRLA exhibits a good generalization ability on object detection and instance segmentation tasks. Our first future work is to consider a comprehensive hyperparameter tuning and more designs that are still worth trying on CNNs and vision transformers, such as attempting other settings for the number of layer attention heads of each stage and using other transformation functions. Moreover, in terms of MRLA design at a higher level, it is also an interesting direction to adapt some other linearized attention mechanisms to a stack of layers. Reproducibility Statement To supplement more explanations and experiments and ensure the reproducibility, we include the schematic diagram of multi-head layer attention, detailed inner structure design and pseudo codes of MRLA-base and MRLA-light in CNNs and vision transformers in Appendix A. Besides, more implementation details about baseline models with our MRLAs are documented in Appendix B. feature dimension. The vanilla self-attention derives the key and value matrix K ∈ R T ×D k and V ∈ R T ×Dout by applying the transformation functions F K and F V respectively: K = F K (X) = F K (Concat[X 1,: , X 2,: , ..., X T,: ]), and V = F V (X) = F V (Concat[X 1,: , X 2,: , ..., X T,: ]). Importantly, both the transformation functions are linear projections, i.e., F K (X) = XW K and F V (X) = XW V . They only operate on the feature dimension D in and thus K = Concat[F K (X 1,: ), ..., F K (X T,: )] and V = Concat[F V (X 1,: ), ..., F V (X T,: )]. ( ) Recall that in layer attention, we assume D in = D out = D and denote the output feature of the t-th layer as X t ∈ R 1×D . In the t-th layer attention, we derive the key and value K t ∈ R t×D k and V t ∈ R t×D by applying the transformation functions f t K and f t V as in Eq.3: K t = Concat[f t K (X 1 ), ..., f t K (X t )] and V t = Concat[f t V (X 1 ), ..., f t V (X t )]. In other words, we directly borrow the formulation in Eq.10 from the vanilla attention. Compared with the Eq.( 5): 3) has larger computation complexity. It is then natural to use the same transformation functions to avoid redundancy that is caused by repeatedly deriving the keys and values for the same layer with different transformation functions. K t = Concat[f 1 K (X 1 ), ..., f t K (X t )] and V t = Concat[f 1 V (X 1 ), ..., f t V (X t )], Eq. ( Approximation in Eq. ( 8) Denote the resulting matrix of (K t-1 ) T V t-1 as B t-1 ∈ R D k ×D with b t-1 j being its j-th row vector. If the non-linear softmax function on the projection of query and key is omitted, then the first approximation in Eq. ( 7) is as follows: Q t (K t-1 ) T V t-1 = (λ t q ⊙ Q t-1 )[(K t-1 ) T V t-1 ] (11) = (λ t q ⊙ Q t-1 )B t-1 (12) = D k j=1 λ t q,j q t-1 j b t-1 j (13) ≈ c D k j=1 q t-1 j b t-1 j (14) = c[Q t-1 (K t-1 ) T V t-1 ], where λ t q,j and q t-1 j in ( 13) are the j-th element of λ t q ∈ R 1×D k and Q t-1 ∈ R 1×D k , respectively. We approximate ( 13) with ( 14) by assuming all the elements of λ t q are the same. As mentioned in Section 4.1, the proposed multi-head design relaxes this condition by only requiring all the elements of λ t q,h ∈ R 1× D k H are the same. We then generalize (15 ) to λ t o ⊙ [Q t-1 (K t-1 ) T V t-1 ] and set λ t o as learnable. This injects more flexibilities along the dimension D and alleviates the effects of the simplification in (14).

A.3 ANOTHER LINEARIZATION TECHNIQUE ON RLA

In addition to using a learnable parameter to bridge the previous layer attention output with the current one in Eq. ( 8), we have tried another technique to linearize the computation of layer attention. We can first rewrite the Eq. ( 6) with the softmax and the scale factor √ D k as follows: O t = t s=1 k(Q t , K t s,: )V t s,: t s=1 k(Q t , K t s,: ) , where the kernel function k(x, y) = exp( x T y √ D k ). Assuming that k(x, y) can be approximated by another kernel with feature representation ϕ(•), that is, k(x, y) = E[ϕ(x) T ϕ(y)] (Choromanski et al., 2021) . Then, the t-th layer attention output can be represented as: O t = t s=1 ϕ(Q t )ϕ(K t s,: ) T V t s,: t s=1 ϕ(Q t )ϕ(K t s,: ) T = ϕ(Q t ) t s=1 ϕ(K t s,: ) T V t s,: ϕ(Q t ) t s=1 ϕ(K t s,: ) T . The last equality holds because of the associative property of matrix multiplication. Adopting the linearization technique proposed by Katharopoulos et al. (2020) , we introduce two variables: U t = t s=1 ϕ(K t s,: ) T V t s,: and Z t = t s=1 ϕ(K t s,: ) T , and simplify the computation as O t = ϕ(Q t )U t ϕ(Q t )Z t . It is worth noting that U t and Z t can be computed from U t-1 and Z t-1 by U t = U t-1 + ϕ(K t t,: ) T V t t,: Z t = Z t-1 + ϕ(K t t,: ) T This version of recurrent layer attention also results in a linear computation complexity with respect to the network depth. However, compared with Eq. ( 8), this attempt suffers from higher memory costs and a lower inference speed. Its multi-head implementation also performs poorer than our proposed MRLA-base/light in experiments. Therefore, we prefer weighting the previous layer attention output with a learnable parameter, which is a more efficient way to strengthen layer interactions in practice.

A.4 MRLA-LIGHT IS CONSISTENT WITH THE BROAD DEFINITION OF ATTENTION

In Eq. ( 8), we have proven that O t = t-1 l=0 β l ⊙ Q t-l (K t-l t-l,: ) T V t-l t-l,: , which is a weighted average of past layers' information V t-l t-l,: . This is consistent with the broad definition of attention that many other attention mechanisms in computer vision use. More concretely, the existing channel attention or spatial attention mechanisms (SE (Hu et al., 2018) , CBAM (Woo et al., 2018) and ECA (Wang et al., 2020b )) obtain the weights (similarities) of channels or pixels via learnable parameters (SE and ECA use two FC layers and a 1D convolution, while CBAM adopts a 2D convolution), and then scale the channels or pixels by the weights. This broader attention definition relaxes the requirement on the weights, allowing them not to fully depend on X but to be freely learnable. Motivated by the above, the extended version of layer attention (MRLA-light) also scales different layers and takes the weighted average of them. It's worth mentioning that MRLA-light is different from SE in terms of SE recalibrating channels within a layer while MRLA-light makes adjustments across layers. Algorithm 1 MRLA-base in CNNs 1: Input: Output of the t-th CNN block X t ∈ R 1×H×W ×C , t -1-th MRLA-base block's key and value K t-1 ∈ R 1×(t-1)×C and V t-1 ∈ R 1×(t-1)×H×W ×C , Number of heads H 2: Output: O t ∈ R 1×H×W ×C , K t ∈ R 1×t×C , V t ∈ R 1×t×H×W ×C 3: // Summarize X t 's spatial information 4: Y t ← GAP(X t ) ∈ R 1×1×C 5: // Derive the current layer's query, key and value via convolutions 6: Q t ← Conv1D(Y t ) ∈ R 1×1×C 7: K t t,: ← Conv1D(Y t ) ∈ R 1×1×C 8: V t t,: ← DWConv2D(X t ) ∈ R 1×1×H×W ×C 9: If t = 1 then 10: // First MRLA-base block 11: K t ← K t t,: 12: V t ← V t t,: 13: else 14: // Concatenate with the previous key and value 15: K t ← Concat[K t-1 , K t t,: ] ∈ R 1×t×C 16: V t ← Concat[V t-1 , V t t,: ] ∈ R 1×t×H×W ×C 17: End if 18: O t ← Multi-head Attention(Q t , K t , V t ) where the number of heads is H

B EXPERIMENTS

Due to the limited space in the main paper, we provide more experimental settings and discussions in this section which are organized as follows. We first provide the implementation details of ImageNet classification and more comparisons with other SOTA networks in Appendix B.1. We also compare the model complexity and memory cost of MLA and MRLAs in Appendix B.2. Next, some implementation details and results of object detection and instance segmentation on COCO are included in Appendix B.3. Then the ablation study of a vision transformer (DeiT) is given in Published as a conference paper at ICLR 2023 Algorithm 2 MRLA-base in Vision Transformers 1: Input: Output of the t-th transformer block X t ∈ R 1×(N +1)×C , t -1-th MRLA-base block's key and value K t-1 ∈ R 1×(t-1)×C and V t-1 ∈ R 1×(t-1)× √ N × √ N ×C , Number of heads H 2: Output: O t ∈ R 1×(N +1)×C , K t ∈ R 1×t×C , V t ∈ R 1×t× √ N × √ N ×C 3: // Split into the class token and patch tokens 4: X t c ∈ R 1×1×C , X t p ∈ R 1×N ×C ← Split(X t ) 5: //Reshape 6: X t p ← Reshape(X t p ) ∈ R 1× √ N × √ N ×C 7: // Summarize X t p 's spatial information 8: Y t p ← GAP(X t p ) ∈ R 1×1×C 9: // Derive the current layer's query, key and value via convolutions 10: Q t ← Conv1D(Y t p ) ∈ R 1×1×C 11: K t t,: ← Conv1D(Y t p ) ∈ R 1×1×C 12: V t t,: ← DWConv2D(X t p ) ∈ R 1×1× √ N × √ N ×C 13: If t = 1 then 14: // First MRLA-base block 15: K t ← K t t,: 16: V t ← V t t,: 17: else 18: // Concatenate with the previous key and value 19: K t ← Concat[K t-1 , K t t,: ] ∈ R 1×t×C 20: V t ← Concat[V t-1 , V t t,: ] ∈ R 1×t× √ N × √ N ×C 21: End if 22: O t p ← Multi-head Attention(Q t , K t , V t ) where the number of heads is H 23: // Reshape to the original dimension 24: O t p ← Reshape(O t p ) ∈ R 1×N ×C 25: O t ← Concat[X t c , O t p ] Algorithm 3 MRLA-light in CNNs 1: Input: Output of the t-th CNN block X t ∈ R 1×H×W ×C , t -1-th MRLA-light block's output O t-1 ∈ R 1×H×W ×C , Number of heads H, Learnable parameter λ t o ∈ R 1×C 2: Output: O t ∈ R 1×H×W ×C 3: // Summarize X t 's spatial information 4: Y t ← GAP(X t ) ∈ R 1×1×C 5: // Derive the current layer's query, key and value via convolutions 6: Q t ← Conv1D(Y t ) ∈ R 1×1×C 7: K t t,: ← Conv1D(Y t ) ∈ R 1×1×C 8: V t t,: ← DWConv2D(X t ) ∈ R 1×1×H×W ×C 9: Õt ← Multi-head Attention(Q t , K t t,: , V t t,: ) where the number of heads is H 10: O t ← Expand(λ t o ) ⊙ O t-1 + Õt Algorithm 4 MRLA-light in Vision Transformers 1: Input: Output of the t-th transformer block X t ∈ R 1×(N +1)×C , t -1-th MRLA-light block's output O t-1 ∈ R 1×(N +1)×C , Number of heads H, Learnable parameter λ t o ∈ R 1×C 2: Output: O t ∈ R 1×(N +1 )×C 3: // Split into the class token and patch tokens 4: X t c ∈ R 1×1×C , X t p ∈ R 1×N ×C ← Split(X t ) 5: O t-1 c ∈ R 1×1×C , O t-1 p ∈ R 1×N ×C ← Split(O t-1 ) 6: //Reshape 7: X t p ← Reshape(X t p ) ∈ R 1× √ N × √ N ×C 8: // Summarize X t p 's spatial information 9: Y t p ← GAP(X t p ) ∈ R 1×1×C 10: // Derive the current layer's query, key and value via convolutions 11: The networks are trained from scratch using SGD with momentum of 0.9, weight decay of 1e-4, and a mini-batch size of 256. The models are trained within 100 epochs by setting the initial learning rate to 0.1, which is decreased by a factor of 10 per 30 epochs. Since the data augmentation and training settings used in ResNet are outdated, which are not as powerful as those used by other networks, strengthening layer interactions leads to overfitting on ResNet. Pretraining on a larger dataset and using extra training settings can be an option; however, as most of our baseline models and the above attention models are not pretrained on larger datasets, these measures will result in an unfair comparison. Hence we use a more efficient strategy: applying stochastic depth (Huang et al., 2016) with survival probability of 0.8 only on our MRLA, the effects of which will be discussed in the ablation study later. Q t ← Conv1D(Y t p ) ∈ R 1×1×C 12: K t t,: ← Conv1D(Y t p ) ∈ R 1×1×C 13: V t t,: ← GELU(DWConv2D(X t p )) ∈ R 1×1× √ N × √ N ×C 14: Õt p ← Multi-head Attention(Q t , EfficientNet For training EfficientNet with our MRLA, we follow the settings in EfficientNet. Specifically, networks are trained within 350 epochs using RMSProp optimizer with momentum of 0.9, decay of 0.9, batch norm momentum of 0.99, weight decay of 4e-5 and mini-batch size of 4096. The initial learning rate is set to 0.256, and is decayed by 0.97 every 2.4 epochs. Since our computational resources hardly support the original batch size, we linearly scale the initial learning rate and the batch size to 0.048 and 768, respectively. We also use AutoAugment (Cubuk et al., 2019) , stochastic depth (Huang et al., 2016) with survival probability 0.8 and dropout (Srivastava et al., 2014) ratio 0.2 for EfficientNet-B0 and EfficientNet-B1. Our MRLA shares the same stochastic depth with the building layer of EfficientNet since it is natural to drop the MRLA block if the corresponding layer is dropped. We implement EfficientNets with these training tricks by pytorch-image-models (timm) toolkit (Wightman, 2019) on 2x V100 GPUs.foot_0  DeiT, CeiT and PVTv2 We adopt the same training and augmentation strategy as that in DeiT. All models are trained for 300 epochs using the AdamW optimizer with weight decay of 0.05. We use the cosine learning rate schedule and set the initial learning rate as 0.001 with batch size of 1024. Five epochs are used to gradually warm up the learning rate at the beginning of the training. We apply RandAugment (Cubuk et al., 2020) , repeated augmentation (Hoffer et al., 2020) , label smoothing (Szegedy et al., 2016) with ϵ = 0.1, Mixup (Zhang et al., 2017) with 0.8 probability, Cutmix (Yun et al., 2019) with 1.0 probability and random erasing (Zhong et al., 2020) Other Relevant Networks TDAM (Jaiswal et al., 2022) and BA-Net (Zhao et al., 2022) adopted different implemental settings from ours in the training of ImageNet. For the baseline model, we used the results from the torchvision toolkit while TDAM utilized those from the pytorch-image-models (timm). Note that the latter implementation includes advanced design settings (e.g., three 3x3 Convs instead of a 7x7 Conv) and training tricks (e.g., cosine learning schedule and label smoothing) to improve the performance of ResNets. And BA-Net applied cosine learning schedule and label smoothing in their training process. Therefore, it would be unfair to directly evaluate the performances between TDAM, BA-Net and MRLAs using the current results. We reproduced the performance of BA-Net with our training settings and train our model with the settings used in BA-Net. The results are given in Table 6 , and we also compare the performances though we do not train our model by using the better pre-trained weights in object detection and instance segmentation tasks. toolkit (Chen et al., 2019) .foot_2 Specifically, the shorter side of input images are resized to 800, then all detectors are optimized using SGD with weight decay of 1e-4, momentum of 0.9 and batch size of 16. The learning rate is initialized to 0.02 and is decreased by a factor of 10 after 8 and 11 epochs, respectively, i.e., the 1x training schedule (12 epochs). For RetinaNet, we modify the initial learning rate to 0.01 to avoid training problems. Since the models no longer suffer from overfitting in these transfer learning tasks, we remove stochastic depth on our MRLA used in ImageNet classification.

Results

Complete results of Mask R-CNN on object detection and instance segmentation are shown in Tables 8 and 9 for comprehensive comparisons. We observe that our method brings clear improvements over the original ResNet on all the evaluation metrics. Compared with the two channel attention methods SE and ECA, our MRLA achieves more gains for small objects, which are usually more difficult to be detected. Interestingly, even though BA-Net utilizes the better pre-trained weight (the model with † superscript in Table 6 ) in object detection and instance segmentation, our approach still outperforms the BA-Net counterpart on most of the metrics. Besides the comparisons with various attention methods using the same baseline models, we also compare our model with different types of networks, as shown in Table 10 .

B.4 DISCUSSION ON ABLATION STUDY

MLA and MRLA-base are both effective. Before introducing MRLA-light, we also proposed the MLA (referring to multi-head version of Eq. ( 4) in Section 3.2) and the MRLA-base (referring to Eq. ( 9) in Section 3.4), both of which rigorously follow the definition of self-attention in Transformer (Vaswani et al., 2017) . From Tables 1 and 4 , we can observe that MLA and MRLA-base perform better than most of other methods. Besides, there is only a negligible gap between their performance and that of MRLA-light in some cases, which can be attributed to the following reasons: • We directly applied MRLA-light's hyper-parameter setting and design to the other two without further tuning. • Benefiting from simpler architecture and learnable λ t o , MRLA-light is easier to train and more flexible. Therefore, if we do not chase the minimum computation, time and memory cost, MLA and MRLAbase are also good choices as they can equivalently enrich the representation power of a network as MRLA-light. (Chi et al., 2020) 39.0 M 266 39.4 58.2 42.5 ---ResNeXt-101-32x4d (Xie et al., 2017) Convolutions in Transformers The improvement of our MRLA over transformers is not entirely caused by the convolution and the experiments in Table 11 support this point: • We have inserted a DWConv layer into DeiT which is a convolution-free transformer. The result demonstrates that our MRLA outperforms adding the DWConv layer. • We have also applied our MRLA to some convolutional transformers, e.g., CeiT (Yuan et al., 2021) and PVTv2 (Wang et al., 2022a) . We can find that our MRLA can further boost the performances of these convolutional transformers. Stochastic Depth Stochastic depth is not the fundamental component that helps MRLA outperform its counterparts. Instead, it is a tool to avoid overfitting (too-high training accuracy) on the middle-size dataset ImageNet-1K. We have found that MLA and MRLA-base also suffer from overfitting problem though they are less severe than that in MRLA-light. As we stated in Section 1, strengthening layer interactions can improve model performances. We then conjecture that the information from previous layers brought by our MRLAs is too strong, leading to the overfitting. The detailed justifications are as follows: • Applying stochastic depth with the same survival probability on the "+ DWConv2d" and ECA module does not bring significant improvements (see Table 12 (a) and (c)), proving that stochastic depth itself is not the key to boosting the model performance. • It is natural to share the same stochastic depth on EfficientNet and vision transformers since the layer attention should not be applied if that layer is dropped. To investigate how MRLAs contribute to the representation learning in CNNs, we visualize the feature maps with the score-weighted visual explanations yielded by ScoreCAM (Wang et al., 2020a) in Figure 6 . Specifically, we extract the feature maps from the end of each stage in ResNet-50 and our MRLA counterparts. We display here the visualizations for Stage 3 and 4 as the feature maps from the first two stages are quite similar and focus on low-level features for all models. The two example images are randomly selected from the ImageNet validation set. In the visualizations, the area with the warmer color contributes more to the classification. We can observe that: (1) The models with MRLAs tend to find the critical areas faster than the baseline model. Especially in stage 3, the MRLAs have already moved to emphasize the high-level features while the baseline model still focuses on the lower-level ones. (2)The areas with the red color in ResNet 50 + MRLA-base/light models are larger than that in the baseline model, implying that the MRLA counterparts utilize more information for the final decision-making. (3) The patterns of MRLA-base and MRLA-light are similar, validating that our approximation in MRLA-light does not sacrifice too much of its ability. Figure 7 visualizes the attention maps of a specified query (red box) from three randomly chosen heads in the last layer of DeiT-T and our MRLA counterparts. The first image is randomly sampled from the ImageNet validation set, and the second image is downloaded from a website. 3 In the visualization, the area with the warmer color has a higher attention score. We can observe that MRLA can help the network retrieve more task-related local details compared to the baseline model. In other words, the low-level features are better preserved with layer attention. 



License: Apache License 2.0 License: Apache License 2.0 https://github.com/luo3300612/Visualizer



Figure 1: (a)Visualization of the layer attention scores from a randomly chosen head of MRLA in each stage of ResNet-50+MRLA model; (b) Schematic diagram of two consecutive layers with RLA.

Figure 2: (a) Detailed operations in MRLA-light block with feature dimensions; (b) Absolute cosine similarity between queries from MRLA-base blocks of two consecutive layers.

Figure 4 illustrates the detailed operations in MRLA-base block with feature dimensions. Most of the inner structure designs are similar to those of MRLA-light block, except for concatenating previous keys and values instead of adding the output of previous MRLA block.

Figure 4: Detailed operations in MRLA-base block with feature dimensions.

Figure5visualizes the FLOPs induced by MRLA-light with respect to the input resolution. We compute the FLOPs of the baseline CeiT-T and our MRLA-light counterpart and then derive their differences under various settings of input resolution. It can be observed that the complexity of MRLA-light is linear to the input resolution.

Figure 5: The FLOPs induced by MRLA-light with respect to input resolution.

Figure 6: Visualizations of the feature maps extracted from the end of Stage 3 and 4 of ResNet-50 and our MRLA counterparts.

Figure 7: Visualizations of the attention maps in the last layer of DeiT-T and our MRLA counterpart given a query specified in the red box.

Comparisons of single-crop accuracy on the ImageNet-1K validation set. † means the results are from torchvision toolkit. The bold fonts denote the best performances.

Object detection results of different methods on COCO val2017. FLOPs are calculated on 1280 × 800 input. The bold fonts denote the best performances.

Object detection and instance segmentation results of different methods using Mask R-CNN as a framework on COCO val2017. AP bb and AP m denote AP of bounding box and mask.

Ablation study on different variants using ResNet-50 as the baseline model.

ResNet For training ResNets with our MRLA, we follow exactly the same data augmentation and hyper-parameter settings in original ResNet. Specifically, the input images are randomly cropped to 224 × 224 with random horizontal flipping.

Complete results of Mask R-CNN on object detection using different methods. The bold fonts denote the best performances.

Complete results of Mask R-CNN on instance segmentation using different methods. The bold fonts denote the best performances.

Object detection results with different backbones using RetinaNet as a framework on COCO val2017. All models are trained in "1x" schedule. FLOPs are calculated on 1280 × 800 input. The blue bold fonts denote the best performances, while the bold ones perform comparably.

APPENDIX

This appendix contains supplementary explanations and experiments to support the proposed layer attention and multi-head recurrent layer attention (MRLA). Appendix A supplements the arguments made in Section 3, including clarification on layer attention in Eq. ( 4), recurrent layer attention in Eq. ( 6) and ( 8) and another linearization of recurrent layer attention with the design in Katharopoulos et al. (2020) . The pseudo codes of the two modules are also attached. Appendix B provides more experiment details, results, explanations for ablation study and visualizations of MRLA-base and MRLA-light in CNNs and vision transformers. Besides, we also attempt to support our motivation through experiment results.

A MULTI-HEAD RECURRENT LAYER ATTENTION

A.1 discusses the differences between layer attention and other layer interaction related work. A.2 elaborates more on the simplification and approximation made in MRLA-base and MRLA-light. Besides, our attempt to linearize RLA with the recurrent method proposed in (Katharopoulos et al., 2020) is included in A.3, which is briefly mentioned in Section 3. A.4 explains more about the broad definition of attention. A.5 illustrates the detailed block design of MRLA-base and provides the pseudo codes of the both modules.

A.1 LAYER INTERACTION RELATED WORK

Apart from the OmniNet (Tay et al., 2021) that we elaborate in Section 1, we would like to further compare our layer attention with other layer interaction related work, including DenseNet (Huang et al., 2017) , DIA-Net (Huang et al., 2020) and RLA g Net (Zhao et al., 2021) . The empirical studies comparing them are included in Appendix B.1.3.Layer Attention Figure 3 illustrates the layer attention proposed in Section 3.2. Here a residual layer refers to a CNN block or a Transformer block (including the multi-head attention layers and the feed-forward layer). For each layer, the feature maps of all preceding layers are used as its layer attention's inputs; and its own feature maps are also reused in all subsequent layer attention, which intrinsically has quadratic complexity. Comparison with DenseNet, DIANet, and RLA g Net Though DenseNet also extracts all previous layers' features and concatenates them before the Conv1x1, it is not an attention-based method. Moreover, it aggregates all previous layers' features in a channel-wise way regardless of which layer a feature comes from. It can be viewed as channel attention across layers. Instead, our layer attention models the features from the same layer in the same way via attention. RLA g Net doesn't use attention, either. Finally, DIANet emphasizes the channel-wise recalibration of the current layer output, which is intrinsically channel attention instead of layer attention.

A.2 DETAILS ABOUT SIMPLIFICATION IN EQ. (5) AND APPROXIMATION IN EQ. (8)

Simplification in Eq. ( 5) This simplification is weak and reasonable. In the vanilla self-attention, suppose we have X ∈ R T ×Din where T stands for the number of tokens and D in is the input 

B.2 OUT-OF-MEMORY PROBLEM AND THE NECESSITY OF MRLA-LIGHT

Though it is possible to train a naive quadratic version with MLA (Eq. ( 4)) and MRLA-base (Eq. ( 6)) on ResNet-50 and obtain good results, it will lead to the out-of-memory (OOM) problem on ResNet-101 if we keep the same batch size of 256. Here, we compare the parameters, FLOPs and memory cost per GPU (all models are trained on 4 V100 GPUs) of MLA, MRLA-base and MRLA-light to support our claim. 

Implementation details

We adopt the commonly used settings (Hu et al., 2018; Wang et al., 2018; Cao et al., 2019; Wang et al., 2020b; Zhao et al., 2021) , which are the default settings in MMDetection Published as a conference paper at ICLR 2023 • For object detection and instance segmentation on COCO, we did not observe any overfitting problem when removing the stochastic depth trick. We speculate that the 12-epoch training leads to underfitting since we adopt the standard 1x training schedule. Therefore, there is no need to use this trick for these two tasks.• There are indeed other solutions to address the overfi tting problem but they are sub-optimal.-We prefer using stochastic depth over pretraining on larger datasets because of limited computational resources and time. The ImageNet-22K (14M) and JFT300M (300M) datasets are significantly larger than the ImageNet-1K (1.28M). Besides, choosing this more efficient strategy allows a fair comparison with current SOTA attention models, as most of them are not pretrained on these larger datasets. -Mixup augmentation (M) and label smoothing (LS) were also tried to prevent overfitting.Using them simultaneously can achieve similar performance to the stochastic depth (see Table 12 (b)). However, these methods influence the entire network instead of our MRLA only, leading to unfair comparisons with other models. -Manually applying MRLA to partial layers instead of stochastic depth is also feasible.However, it maycost much more time to decide MRLA at which layer should be dropped. 

