CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER ATTENTION

Abstract

More and more evidence has shown that strengthening layer interactions can enhance the representation power of a deep neural network, while self-attention excels at learning interdependencies by retrieving query-activated information. Motivated by this, we devise a cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), that sends a query representation of the current layer to all previous layers to retrieve query-related information from different levels of receptive fields. A light-weighted version of MRLA is also proposed to reduce the quadratic computation cost. The proposed layer attention mechanism can enrich the representation power of many state-of-the-art vision networks, including CNNs and vision transformers. Its effectiveness has been extensively evaluated in image classification, object detection and instance segmentation tasks, where improvements can be consistently observed. For example, our MRLA can improve 1.6% Top-1 accuracy on ResNet-50, while only introducing 0.16M parameters and 0.07B FLOPs. Surprisingly, it can boost the performances by a large margin of 3-4% box AP and mask AP in dense prediction tasks. Our code is available at https://github.com/joyfang1106/MRLA.

1. INTRODUCTION

Growing evidence indicates that strengthening layer interactions can encourage the information flow of a deep neural network (He et al., 2016; Huang et al., 2017; Zhao et al., 2021) . For example, in vision networks, the receptive fields are usually enlarged as layers are stacked. These hierarchical receptive fields play different roles in extracting features: local texture features are captured by small receptive fields, while global semantic features are captured by large receptive fields. Hence encouraging layer interactions can enhance the representation power of networks by combining different levels of features. Previous empirical studies also support the necessity of building interdependencies across layers. ResNet (He et al., 2016) proposed to add a skip connection between two consecutive layers. DenseNet (Huang et al., 2017) further reinforced layer interactions by making layers accessible to all subsequent layers within a stage. Recently, GLOM (Hinton, 2021) adopted an intensely interacted architecture that includes bottom-up, top-down, and same-level interactions, attempting to represent part-whole hierarchies in a neural network. In the meantime, the attention mechanism has proven itself in learning interdependencies by retrieving query-activated information in deep neural networks. Current works about attention lay much emphasis on amplifying interactions within a layer (Hu et al., 2018; Woo et al., 2018; Dosovitskiy et al., 2021) . They implement attention on channels, spatial locations, and patches; however, none of them consider attention on layers, which are actually the higher-level features of a network. It is then natural to ask: "Can attention replicate its success in strengthening layer interactions?" This paper gives a positive answer. Specifically, starting from the vanilla attention, we first give a formal definition of layer attention. Under this definition, a query representation of the current layer is sent to all previous layers to retrieve related information from hierarchical receptive fields. The resulting attention scores concretely depict the cross-layer dependencies, which also quantify the importance of hierarchical information to the query layer. Furthermore, utilizing the sequential structure of networks, we suggest a way to perform layer attention recurrently in Section 3.3 and call it recurrent layer attention (RLA). A multi-head design is naturally introduced to representation subspaces, and hence comes multi-head RLA (MRLA). Figure 1 (a) visualizes the layer attention scores yielded by MRLA at Eq. ( 6). Interestingly, most layers pay more attention to the first layer within the stage, verifying our motivation for retrospectively retrieving information. Inheriting from the vanilla attention, MRLA has a quadratic complexity of O(T 2 ), where T is the depth of a network. When applied to very deep networks, this will incur a high computation cost and possibly the out-of-memory problem. To mitigate the issues, this paper makes an attempt to devise a light-weighted version of MRLA with linear complexity of O(T ). After imposing a linearized approximation, MRLA becomes more efficient and has a broader sphere of applications. To our best knowledge, our work is the first attempt to systematically study cross-layer dependencies via attention. It is different from the information aggregation in DenseNet because the latter aggregates all previous layers' features in a channel-wise way regardless of which layer a feature comes from. OmniNet (Tay et al., 2021) and ACLA (Wang et al., 2022b) follow the same spirit as DensetNet. They allow each token from each layer to attend to all tokens from all previous layers. Essentially, both of them neglect the layer identity of each token. By contrast, we stand upon the layer attention to retrospectively retrieve query-related features from previous layers. Besides, to bypass the high computation cost, OmniNet divides the network into several partitions and inserts the omnidirectional attention block only after the last layer of each partition; and ACLA samples tokens with gates from each layer. Instead, our light-weighted version of MRLA can be easily applied to each layer. The two versions of MRLA can improve many state-of-the-art (SOTA) vision networks, such as convolutional neural networks (CNNs) and vision transformers. We have conducted extensive experiments across various tasks, including image classification, object detection and instance segmentation. The experiment results show that our MRLA performs favorably against its counterparts. Especially in dense prediction tasks, it can outperform other SOTA networks by a large margin. The visualizations (see Appendix B.5) show that our MRLA can retrieve local texture features with positional information from previous layers, which may account for its remarkable success in dense prediction. The main contributions of this paper are summarized below: (1) A novel layer attention, MRLA, is proposed to strengthen cross-layer interactions by retrieving query-related information from previous layers. (2) A light-weighted version of MRLA with linear complexity is further devised to make cross-layer attention feasible to more deep networks. (3) We show that MRLA is compatible with many networks, and validate its effectiveness across a broad range of tasks on benchmark datasets. (4) We investigate the important design elements of our MRLA block through an ablation study and provide guidelines for its applications on convolutional and transformer-based vision models.

2. RELATED WORK

Layer Interaction Apart from the works mentioned above, other CNN-based and transformerbased models also put much effort into strengthening layer interactions. DIANet (Huang et al., 2020) utilized a parameter-sharing LSTM along the network depth to model the cross-channel relationships with the help of previous layers' information. CN-CNN (Guo et al., 2022 ) combined DIANet's LSTM and spatial and channel attention for feature fusion across layers. A similar RNN module was applied



Figure 1: (a)Visualization of the layer attention scores from a randomly chosen head of MRLA in each stage of ResNet-50+MRLA model; (b) Schematic diagram of two consecutive layers with RLA.

